Order-dependence
Narwhals has four main public classes:
Expr
: what gets created when you writenw.col('a')
.DataFrame
: in-memory, eager dataframe with a well-defined row order which is preserved acrosswith_columns
andselect
operations.LazyFrame
: a dataframe which makes no assumptions about row-ordering. This allows it to be backed by SQL engines.Series
: 1-dimensional in-memory structure with a defined row order. This is what you get if you extract a single column from aDataFrame
.
Row order is important to think about when performing operations which rely on it, such as:
diff
,shift
.cum_sum
,cum_min
, ...rolling_sum
,rolling_min
, ...is_first_distinct
,is_last_distinct
.
When row-order is defined, as is the case for DataFrame
, these operations pose
no issue.
import narwhals as nw
import pandas as pd
df_pd = pd.DataFrame({"a": [1, 3, 4], "i": [0, 1, 2]})
df = nw.from_native(df_pd)
print(df.with_columns(a_cum_sum=nw.col("a").cum_sum()))
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a i a_cum_sum|
|0 1 0 1|
|1 3 1 4|
|2 4 2 8|
└──────────────────┘
When row order is undefined however, then these operations do not have a defined
result. To make them well-defined, you need to follow them with over
in which
you specify order_by
. For example:
nw.col('a').cum_sum()
can only be executed by aDataFrame
.nw.col('a').cum_sum().over(order_by="i")
can only be executed by either aDataFrame
or aLazyFrame
.
from sqlframe.duckdb import DuckDBSession
session = DuckDBSession()
sqlframe_df = session.createDataFrame(df_pd)
lf = nw.from_native(sqlframe_df)
result = lf.with_columns(a_cum_sum=nw.col("a").cum_sum().over(order_by="i"))
print(result)
print(result.collect("pandas"))
┌────────────────────────────────────────────────────────────────────┐
| Narwhals LazyFrame |
|--------------------------------------------------------------------|
|<sqlframe.duckdb.dataframe.DuckDBDataFrame object at 0x7fa3216b1f30>|
└────────────────────────────────────────────────────────────────────┘
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
| a i a_cum_sum|
|0 1 0 1|
|1 3 1 4|
|2 4 2 8|
└──────────────────┘
When writing an order-dependent function, if you want it to be executable by LazyFrame
(and not just DataFrame
), make sure that all order-dependent expressions are followed
by over
with order_by
specified. If you forget to, don't worry, Narwhals will
give you a loud and clear error message.