Skip to content

Order-dependence

Narwhals has four main public classes:

  • Expr: this is what gets created when you write nw.col('a').
  • DataFrame: in-memory, eager dataframe with a well-defined row order which is preserved across with_columns and select operations.
  • LazyFrame: a dataframe which makes no assumptions about row-ordering. This allows it to be backed by SQL engines.
  • Series: 1-dimensional in-memory structure with a defined row order. This is what you get if you extract a single column from a DataFrame.

Row order is important to think about when performing operations which rely on it, such as:

  • diff, shift.
  • cum_sum, cum_min, ...
  • rolling_sum, rolling_min, ...
  • is_first_distinct, is_last_distinct.

When row-order is defined, as is the case for DataFrame, these operations pose no issue.

import narwhals as nw
import pandas as pd

data = {"a": [1, 3, 4], "i": [0, 1, 2]}
df = nw.from_native(pd.DataFrame(data))
print(df.with_columns(a_cum_sum=nw.col("a").cum_sum()))
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|   a  i  a_cum_sum|
|0  1  0          1|
|1  3  1          4|
|2  4  2          8|
└──────────────────┘

When row order is undefined however, then these operations do not have a defined result. To make them well-defined, you need to follow them with over in which you specify order_by. For example:

  • nw.col('a').cum_sum() can only be executed by a DataFrame.
  • nw.col('a').cum_sum().over(order_by="i") can only be executed by either a DataFrame or a LazyFrame.
import polars as pl

lf = nw.from_native(pl.LazyFrame(data))
result = lf.with_columns(a_cum_sum=nw.col("a").cum_sum().over(order_by="i"))
print(result.collect())
┌─────────────────────────┐
|   Narwhals DataFrame    |
|-------------------------|
|shape: (3, 3)            |
|┌─────┬─────┬───────────┐|
| a    i    a_cum_sum |
| ---  ---  ---       |
| i64  i64  i64       |
|╞═════╪═════╪═══════════╡|
| 1    0    1         |
| 3    1    4         |
| 4    2    8         |
|└─────┴─────┴───────────┘|
└─────────────────────────┘

When writing an order-dependent function, if you want it to be executable by LazyFrame (and not just DataFrame), make sure that all order-dependent expressions are followed by over with order_by specified. If you forget to, don't worry, Narwhals will give you a loud and clear error message.