narwhals.DataFrame
Narwhals DataFrame, backed by a native dataframe.
The native dataframe might be pandas.DataFrame, polars.DataFrame, ...
This class is not meant to be instantiated directly - instead, use
narwhals.from_native
.
columns: list[str]
property
Get column names.
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.columns
We can pass any supported library such as pandas, Polars, or PyArrow to func
:
>>> func(df_pd)
['foo', 'bar', 'ham']
>>> func(df_pl)
['foo', 'bar', 'ham']
>>> func(df_pa)
['foo', 'bar', 'ham']
schema: Schema
property
Get an ordered mapping of column names to their data type.
Examples:
>>> import polars as pl
>>> import pandas as pd
>>> import narwhals as nw
>>> data = {
... "foo": [1, 2, 3],
... "bar": [6.0, 7.0, 8.0],
... "ham": ["a", "b", "c"],
... }
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.schema
You can pass either pandas or Polars to func
:
>>> df_pd_schema = func(df_pd)
>>> df_pd_schema
Schema({'foo': Int64, 'bar': Float64, 'ham': String})
>>> df_pl_schema = func(df_pl)
>>> df_pl_schema
Schema({'foo': Int64, 'bar': Float64, 'ham': String})
shape: tuple[int, int]
property
Get the shape of the DataFrame.
Examples:
Construct pandas and polars DataFrames:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> from narwhals.typing import IntoDataFrame
>>>
>>> df = {"foo": [1, 2, 3, 4, 5]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def agnostic_shape(df_native: IntoDataFrame) -> tuple[int, int]:
... df = nw.from_native(df_native)
... return df.shape
We can then pass either pandas, Polars or PyArrow to agnostic_shape
:
>>> agnostic_shape(df_pd)
(5, 1)
>>> agnostic_shape(df_pl)
(5, 1)
>>> agnostic_shape(df_pa)
(5, 1)
__arrow_c_stream__(requested_schema=None)
Export a DataFrame via the Arrow PyCapsule Interface.
- if the underlying dataframe implements the interface, it'll return that
- else, it'll call
to_arrow
and then defer to PyArrow's implementation
See PyCapsule Interface for more.
__getitem__(item)
__getitem__(item: tuple[Sequence[int], slice]) -> Self
__getitem__(
item: tuple[Sequence[int], Sequence[int]]
) -> Self
__getitem__(item: tuple[slice, Sequence[int]]) -> Self
__getitem__(item: tuple[Sequence[int], str]) -> Series
__getitem__(item: tuple[slice, str]) -> Series
__getitem__(
item: tuple[Sequence[int], Sequence[str]]
) -> Self
__getitem__(item: tuple[slice, Sequence[str]]) -> Self
__getitem__(item: tuple[Sequence[int], int]) -> Series
__getitem__(item: tuple[slice, int]) -> Series
__getitem__(item: Sequence[int]) -> Self
__getitem__(item: str) -> Series
__getitem__(item: Sequence[str]) -> Self
__getitem__(item: slice) -> Self
__getitem__(item: tuple[slice, slice]) -> Self
Extract column or slice of DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item
|
str | slice | Sequence[int] | Sequence[str] | tuple[Sequence[int], str | int] | tuple[slice, str | int] | tuple[slice | Sequence[int], Sequence[int] | Sequence[str] | slice] | tuple[slice, slice]
|
How to slice dataframe. What happens depends on what is passed. It's easiest
to explain by example. Suppose we have a Dataframe
|
required |
Notes
- Integers are always interpreted as positions
- Strings are always interpreted as column names.
In contrast with Polars, pandas allows non-string column names.
If you don't know whether the column name you're trying to extract
is definitely a string (e.g. df[df.columns[0]]
) then you should
use DataFrame.get_column
instead.
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> from narwhals.typing import IntoDataFrame
>>> from narwhals.typing import IntoSeries
>>>
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
>>> df_pa = pa.table(data)
We define a library agnostic function:
>>> def agnostic_slice(df_native: IntoDataFrame) -> IntoSeries:
... df = nw.from_native(df_native)
... return df["a"].to_native()
We can then pass either pandas, Polars or PyArrow to agnostic_slice
:
>>> agnostic_slice(df_pd)
0 1
1 2
Name: a, dtype: int64
>>> agnostic_slice(df_pl)
shape: (2,)
Series: 'a' [i64]
[
1
2
]
>>> agnostic_slice(df_pa)
<pyarrow.lib.ChunkedArray object at ...>
[
[
1,
2
]
]
clone()
Create a copy of this DataFrame.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function in which we clone the DataFrame:
>>> @nw.narwhalify
... def func(df):
... return df.clone()
>>> func(df_pd)
a b
0 1 3
1 2 4
>>> func(df_pl)
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 2 ┆ 4 │
└─────┴─────┘
collect_schema()
Get an ordered mapping of column names to their data type.
Examples:
>>> import polars as pl
>>> import pandas as pd
>>> import narwhals as nw
>>> data = {
... "foo": [1, 2, 3],
... "bar": [6.0, 7.0, 8.0],
... "ham": ["a", "b", "c"],
... }
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.collect_schema()
You can pass either pandas or Polars to func
:
>>> df_pd_schema = func(df_pd)
>>> df_pd_schema
Schema({'foo': Int64, 'bar': Float64, 'ham': String})
>>> df_pl_schema = func(df_pl)
>>> df_pl_schema
Schema({'foo': Int64, 'bar': Float64, 'ham': String})
drop(*columns, strict=True)
Remove columns from the dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*columns
|
str | Iterable[str]
|
Names of the columns that should be removed from the dataframe. |
()
|
strict
|
bool
|
Validate that all column names exist in the schema and throw an exception if a column name does not exist in the schema. |
True
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> data = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.drop("ham")
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo bar
0 1 6.0
1 2 7.0
2 3 8.0
>>> func(df_pl)
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1 ┆ 6.0 │
│ 2 ┆ 7.0 │
│ 3 ┆ 8.0 │
└─────┴─────┘
Use positional arguments to drop multiple columns.
>>> @nw.narwhalify
... def func(df):
... return df.drop("foo", "ham")
>>> func(df_pd)
bar
0 6.0
1 7.0
2 8.0
>>> func(df_pl)
shape: (3, 1)
┌─────┐
│ bar │
│ --- │
│ f64 │
╞═════╡
│ 6.0 │
│ 7.0 │
│ 8.0 │
└─────┘
drop_nulls(subset=None)
Drop null values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subset
|
str | list[str] | None
|
Column name(s) for which null values are considered. If set to None (default), use all columns. |
None
|
Notes
pandas and Polars handle null values differently. Polars distinguishes between NaN and Null, whereas pandas doesn't.
Examples:
>>> import polars as pl
>>> import pandas as pd
>>> import narwhals as nw
>>> data = {"a": [1.0, 2.0, None], "ba": [1.0, None, 2.0]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.drop_nulls()
We can then pass either pandas or Polars:
>>> func(df_pd)
a ba
0 1.0 1.0
>>> func(df_pl)
shape: (1, 2)
┌─────┬─────┐
│ a ┆ ba │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 1.0 ┆ 1.0 │
└─────┴─────┘
filter(*predicates)
Filter the rows in the DataFrame based on one or more predicate expressions.
The original order of the remaining rows is preserved.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*predicates
|
IntoExpr | Iterable[IntoExpr] | list[bool]
|
Expression(s) that evaluates to a boolean Series. Can also be a (single!) boolean list. |
()
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {
... "foo": [1, 2, 3],
... "bar": [6, 7, 8],
... "ham": ["a", "b", "c"],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
Let's define a dataframe-agnostic function in which we filter on one condition.
>>> @nw.narwhalify
... def func(df):
... return df.filter(nw.col("foo") > 1)
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo bar ham
1 2 7 b
2 3 8 c
>>> func(df_pl)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘
Filter on multiple conditions, combined with and/or operators:
>>> @nw.narwhalify
... def func(df):
... return df.filter((nw.col("foo") < 3) & (nw.col("ham") == "a"))
>>> func(df_pd)
foo bar ham
0 1 6 a
>>> func(df_pl)
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
└─────┴─────┴─────┘
>>> @nw.narwhalify
... def func(df):
... return df.filter((nw.col("foo") == 1) | (nw.col("ham") == "c"))
>>> func(df_pd)
foo bar ham
0 1 6 a
2 3 8 c
>>> func(df_pl)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘
Provide multiple filters using *args
syntax:
>>> @nw.narwhalify
... def func(df):
... dframe = df.filter(
... nw.col("foo") <= 2,
... ~nw.col("ham").is_in(["b", "c"]),
... )
... return dframe
>>> func(df_pd)
foo bar ham
0 1 6 a
>>> func(df_pl)
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
└─────┴─────┴─────┘
gather_every(n, offset=0)
Take every nth row in the DataFrame and return as a new DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
Gather every n-th row. |
required |
offset
|
int
|
Starting index. |
0
|
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function in which gather every 2 rows, starting from a offset of 1:
>>> @nw.narwhalify
... def func(df):
... return df.gather_every(n=2, offset=1)
>>> func(df_pd)
a b
1 2 6
3 4 8
>>> func(df_pl)
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2 ┆ 6 │
│ 4 ┆ 8 │
└─────┴─────┘
get_column(name)
Get a single column by name.
Notes
Although name
is typed as str
, pandas does allow non-string column
names, and they will work when passed to this function if the
narwhals.DataFrame
is backed by a pandas dataframe with non-string
columns. This function can only be used to extract a column by name, so
there is no risk of ambiguity.
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> from narwhals.typing import IntoDataFrame
>>> from narwhals.typing import IntoSeries
>>>
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
We define a library agnostic function:
>>> def agnostic_get_column(df_native: IntoDataFrame) -> IntoSeries:
... df = nw.from_native(df_native)
... name = df.columns[0]
... return df.get_column(name).to_native()
We can then pass either pandas or Polars to agnostic_get_column
:
>>> agnostic_get_column(df_pd)
0 1
1 2
Name: a, dtype: int64
>>> agnostic_get_column(df_pl)
shape: (2,)
Series: 'a' [i64]
[
1
2
]
group_by(*keys, drop_null_keys=False)
Start a group by operation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*keys
|
str | Iterable[str]
|
Column(s) to group by. Accepts multiple columns names as a list. |
()
|
drop_null_keys
|
bool
|
if True, then groups where any key is null won't be included in the result. |
False
|
Returns:
Name | Type | Description |
---|---|---|
GroupBy |
GroupBy[Self]
|
Object which can be used to perform aggregations. |
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {
... "a": ["a", "b", "a", "b", "c"],
... "b": [1, 2, 1, 3, 3],
... "c": [5, 4, 3, 2, 1],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
Let's define a dataframe-agnostic function in which we group by one column
and call agg
to compute the grouped sum of another column.
>>> @nw.narwhalify
... def func(df):
... return df.group_by("a").agg(nw.col("b").sum()).sort("a")
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
a b
0 a 2
1 b 5
2 c 3
>>> func(df_pl)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a ┆ 2 │
│ b ┆ 5 │
│ c ┆ 3 │
└─────┴─────┘
Group by multiple columns by passing a list of column names.
>>> @nw.narwhalify
... def func(df):
... return df.group_by(["a", "b"]).agg(nw.max("c")).sort("a", "b")
>>> func(df_pd)
a b c
0 a 1 5
1 b 2 4
2 b 3 2
3 c 3 1
>>> func(df_pl)
shape: (4, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 │
│ b ┆ 2 ┆ 4 │
│ b ┆ 3 ┆ 2 │
│ c ┆ 3 ┆ 1 │
└─────┴─────┴─────┘
head(n=5)
Get the first n
rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
Number of rows to return. If a negative value is passed, return all rows
except the last |
5
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {
... "foo": [1, 2, 3, 4, 5],
... "bar": [6, 7, 8, 9, 10],
... "ham": ["a", "b", "c", "d", "e"],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
Let's define a dataframe-agnostic function that gets the first 3 rows.
>>> @nw.narwhalify
... def func(df):
... return df.head(3)
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo bar ham
0 1 6 a
1 2 7 b
2 3 8 c
>>> func(df_pl)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└─────┴─────┴─────┘
is_duplicated()
Get a mask of all duplicated rows in this DataFrame.
Returns:
Type | Description |
---|---|
Series
|
A new Series. |
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> df_pd = pd.DataFrame(
... {
... "a": [1, 2, 3, 1],
... "b": ["x", "y", "z", "x"],
... }
... )
>>> df_pl = pl.DataFrame(
... {
... "a": [1, 2, 3, 1],
... "b": ["x", "y", "z", "x"],
... }
... )
Let's define a dataframe-agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.is_duplicated()
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
0 True
1 False
2 False
3 True
dtype: bool
>>> func(df_pl)
shape: (4,)
Series: '' [bool]
[
true
false
false
true
]
is_empty()
Check if the dataframe is empty.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
Let's define a dataframe-agnostic function that filters rows in which "foo" values are greater than 10, and then checks if the result is empty or not:
>>> @nw.narwhalify
... def func(df):
... return df.filter(nw.col("foo") > 10).is_empty()
We can then pass either pandas or Polars to func
:
>>> df_pd = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> df_pl = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> func(df_pd), func(df_pl)
(True, True)
>>> df_pd = pd.DataFrame({"foo": [100, 2, 3], "bar": [4, 5, 6]})
>>> df_pl = pl.DataFrame({"foo": [100, 2, 3], "bar": [4, 5, 6]})
>>> func(df_pd), func(df_pl)
(False, False)
is_unique()
Get a mask of all unique rows in this DataFrame.
Returns:
Type | Description |
---|---|
Series
|
A new Series. |
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> df_pd = pd.DataFrame(
... {
... "a": [1, 2, 3, 1],
... "b": ["x", "y", "z", "x"],
... }
... )
>>> df_pl = pl.DataFrame(
... {
... "a": [1, 2, 3, 1],
... "b": ["x", "y", "z", "x"],
... }
... )
Let's define a dataframe-agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.is_unique()
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
0 False
1 True
2 True
3 False
dtype: bool
>>> func(df_pl)
shape: (4,)
Series: '' [bool]
[
false
true
true
false
]
item(row=None, column=None)
Return the DataFrame as a scalar, or return the element at the given row/column.
Notes
If row/col not provided, this is equivalent to df[0,0], with a check that the shape is (1,1). With row/col, this is equivalent to df[row,col].
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {"a": [1, 2, 3], "b": [4, 5, 6]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function that returns item at given row/column
>>> @nw.narwhalify
... def func(df, row, column):
... return df.item(row, column)
We can then pass either pandas or Polars to func
:
>>> func(df_pd, 1, 1), func(df_pd, 2, "b")
(np.int64(5), np.int64(6))
>>> func(df_pl, 1, 1), func(df_pl, 2, "b")
(5, 6)
iter_rows(*, named=False, buffer_size=512)
iter_rows(
*, named: Literal[False], buffer_size: int = ...
) -> Iterator[tuple[Any, ...]]
iter_rows(
*, named: Literal[True], buffer_size: int = ...
) -> Iterator[dict[str, Any]]
iter_rows(
*, named: bool, buffer_size: int = ...
) -> Iterator[tuple[Any, ...]] | Iterator[dict[str, Any]]
Returns an iterator over the DataFrame of rows of python-native values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named
|
bool
|
By default, each row is returned as a tuple of values given in the same order as the frame columns. Setting named=True will return rows of dictionaries instead. |
False
|
buffer_size
|
int
|
Determines the number of rows that are buffered internally while iterating over the data. See https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.iter_rows.html |
512
|
Notes
cuDF doesn't support this method.
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df, *, named):
... return df.iter_rows(named=named)
We can then pass either pandas or Polars to func
:
>>> [row for row in func(df_pd, named=False)]
[(1, 6.0, 'a'), (2, 7.0, 'b'), (3, 8.0, 'c')]
>>> [row for row in func(df_pd, named=True)]
[{'foo': 1, 'bar': 6.0, 'ham': 'a'}, {'foo': 2, 'bar': 7.0, 'ham': 'b'}, {'foo': 3, 'bar': 8.0, 'ham': 'c'}]
>>> [row for row in func(df_pl, named=False)]
[(1, 6.0, 'a'), (2, 7.0, 'b'), (3, 8.0, 'c')]
>>> [row for row in func(df_pl, named=True)]
[{'foo': 1, 'bar': 6.0, 'ham': 'a'}, {'foo': 2, 'bar': 7.0, 'ham': 'b'}, {'foo': 3, 'bar': 8.0, 'ham': 'c'}]
join(other, on=None, how='inner', *, left_on=None, right_on=None, suffix='_right')
Join in SQL-like fashion.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
Self
|
DataFrame to join with. |
required |
on
|
str | list[str] | None
|
Name(s) of the join columns in both DataFrames. If set, |
None
|
how
|
Literal['inner', 'left', 'cross', 'semi', 'anti']
|
Join strategy.
|
'inner'
|
left_on
|
str | list[str] | None
|
Join column of the left DataFrame. |
None
|
right_on
|
str | list[str] | None
|
Join column of the right DataFrame. |
None
|
suffix
|
str
|
Suffix to append to columns with a duplicate name. |
'_right'
|
Returns:
Type | Description |
---|---|
Self
|
A new joined DataFrame |
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {
... "foo": [1, 2, 3],
... "bar": [6.0, 7.0, 8.0],
... "ham": ["a", "b", "c"],
... }
>>> data_other = {
... "apple": ["x", "y", "z"],
... "ham": ["a", "b", "d"],
... }
>>> df_pd = pd.DataFrame(data)
>>> other_pd = pd.DataFrame(data_other)
>>> df_pl = pl.DataFrame(data)
>>> other_pl = pl.DataFrame(data_other)
Let's define a dataframe-agnostic function in which we join over "ham" column:
>>> @nw.narwhalify
... def join_on_ham(df, other_any):
... return df.join(other_any, left_on="ham", right_on="ham")
We can now pass either pandas or Polars to the function:
>>> join_on_ham(df_pd, other_pd)
foo bar ham apple
0 1 6.0 a x
1 2 7.0 b y
>>> join_on_ham(df_pl, other_pl)
shape: (2, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6.0 ┆ a ┆ x │
│ 2 ┆ 7.0 ┆ b ┆ y │
└─────┴─────┴─────┴───────┘
join_asof(other, *, left_on=None, right_on=None, on=None, by_left=None, by_right=None, by=None, strategy='backward')
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
Self
|
DataFrame to join with. |
required |
left_on
|
str | None
|
Name(s) of the left join column(s). |
None
|
right_on
|
str | None
|
Name(s) of the right join column(s). |
None
|
on
|
str | None
|
Join column of both DataFrames. If set, left_on and right_on should be None. |
None
|
by_left
|
str | list[str] | None
|
join on these columns before doing asof join |
None
|
by_right
|
str | list[str] | None
|
join on these columns before doing asof join |
None
|
by
|
str | list[str] | None
|
join on these columns before doing asof join |
None
|
strategy
|
Literal['backward', 'forward', 'nearest']
|
Join strategy. The default is "backward".
|
'backward'
|
Returns:
Type | Description |
---|---|
Self
|
A new joined DataFrame |
Examples:
>>> from datetime import datetime
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data_gdp = {
... "datetime": [
... datetime(2016, 1, 1),
... datetime(2017, 1, 1),
... datetime(2018, 1, 1),
... datetime(2019, 1, 1),
... datetime(2020, 1, 1),
... ],
... "gdp": [4164, 4411, 4566, 4696, 4827],
... }
>>> data_population = {
... "datetime": [
... datetime(2016, 3, 1),
... datetime(2018, 8, 1),
... datetime(2019, 1, 1),
... ],
... "population": [82.19, 82.66, 83.12],
... }
>>> gdp_pd = pd.DataFrame(data_gdp)
>>> population_pd = pd.DataFrame(data_population)
>>> gdp_pl = pl.DataFrame(data_gdp).sort("datetime")
>>> population_pl = pl.DataFrame(data_population).sort("datetime")
Let's define a dataframe-agnostic function in which we join over "datetime" column:
>>> @nw.narwhalify
... def join_asof_datetime(df, other_any, strategy):
... return df.join_asof(other_any, on="datetime", strategy=strategy)
We can now pass either pandas or Polars to the function:
>>> join_asof_datetime(population_pd, gdp_pd, strategy="backward")
datetime population gdp
0 2016-03-01 82.19 4164
1 2018-08-01 82.66 4566
2 2019-01-01 83.12 4696
>>> join_asof_datetime(population_pl, gdp_pl, strategy="backward")
shape: (3, 3)
┌─────────────────────┬────────────┬──────┐
│ datetime ┆ population ┆ gdp │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ i64 │
╞═════════════════════╪════════════╪══════╡
│ 2016-03-01 00:00:00 ┆ 82.19 ┆ 4164 │
│ 2018-08-01 00:00:00 ┆ 82.66 ┆ 4566 │
│ 2019-01-01 00:00:00 ┆ 83.12 ┆ 4696 │
└─────────────────────┴────────────┴──────┘
Here is a real-world times-series example that uses by
argument.
>>> from datetime import datetime
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data_quotes = {
... "datetime": [
... datetime(2016, 5, 25, 13, 30, 0, 23),
... datetime(2016, 5, 25, 13, 30, 0, 23),
... datetime(2016, 5, 25, 13, 30, 0, 30),
... datetime(2016, 5, 25, 13, 30, 0, 41),
... datetime(2016, 5, 25, 13, 30, 0, 48),
... datetime(2016, 5, 25, 13, 30, 0, 49),
... datetime(2016, 5, 25, 13, 30, 0, 72),
... datetime(2016, 5, 25, 13, 30, 0, 75),
... ],
... "ticker": [
... "GOOG",
... "MSFT",
... "MSFT",
... "MSFT",
... "GOOG",
... "AAPL",
... "GOOG",
... "MSFT",
... ],
... "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
... "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
... }
>>> data_trades = {
... "datetime": [
... datetime(2016, 5, 25, 13, 30, 0, 23),
... datetime(2016, 5, 25, 13, 30, 0, 38),
... datetime(2016, 5, 25, 13, 30, 0, 48),
... datetime(2016, 5, 25, 13, 30, 0, 48),
... datetime(2016, 5, 25, 13, 30, 0, 48),
... ],
... "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
... "price": [51.95, 51.95, 720.77, 720.92, 98.0],
... "quantity": [75, 155, 100, 100, 100],
... }
>>> quotes_pd = pd.DataFrame(data_quotes)
>>> trades_pd = pd.DataFrame(data_trades)
>>> quotes_pl = pl.DataFrame(data_quotes).sort("datetime")
>>> trades_pl = pl.DataFrame(data_trades).sort("datetime")
Let's define a dataframe-agnostic function in which we join over "datetime" and by "ticker" columns:
>>> @nw.narwhalify
... def join_asof_datetime_by_ticker(df, other_any):
... return df.join_asof(other_any, on="datetime", by="ticker")
We can now pass either pandas or Polars to the function:
>>> join_asof_datetime_by_ticker(trades_pd, quotes_pd)
datetime ticker price quantity bid ask
0 2016-05-25 13:30:00.000023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.000038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.000048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.000048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.000048 AAPL 98.00 100 NaN NaN
>>> join_asof_datetime_by_ticker(trades_pl, quotes_pl)
shape: (5, 6)
┌────────────────────────────┬────────┬────────┬──────────┬───────┬────────┐
│ datetime ┆ ticker ┆ price ┆ quantity ┆ bid ┆ ask │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ f64 ┆ i64 ┆ f64 ┆ f64 │
╞════════════════════════════╪════════╪════════╪══════════╪═══════╪════════╡
│ 2016-05-25 13:30:00.000023 ┆ MSFT ┆ 51.95 ┆ 75 ┆ 51.95 ┆ 51.96 │
│ 2016-05-25 13:30:00.000038 ┆ MSFT ┆ 51.95 ┆ 155 ┆ 51.97 ┆ 51.98 │
│ 2016-05-25 13:30:00.000048 ┆ GOOG ┆ 720.77 ┆ 100 ┆ 720.5 ┆ 720.93 │
│ 2016-05-25 13:30:00.000048 ┆ GOOG ┆ 720.92 ┆ 100 ┆ 720.5 ┆ 720.93 │
│ 2016-05-25 13:30:00.000048 ┆ AAPL ┆ 98.0 ┆ 100 ┆ null ┆ null │
└────────────────────────────┴────────┴────────┴──────────┴───────┴────────┘
lazy()
Lazify the DataFrame (if possible).
If a library does not support lazy execution, then this is a no-op.
Returns:
Type | Description |
---|---|
LazyFrame[Any]
|
A new LazyFrame. |
Examples:
Construct pandas, Polars and PyArrow DataFrames:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> from narwhals.typing import IntoFrame
>>>
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def agnostic_lazy(df_native: IntoFrame) -> IntoFrame:
... df = nw.from_native(df_native)
... return df.lazy().to_native()
Note that then, pandas and pyarrow dataframe stay eager, but Polars DataFrame becomes a Polars LazyFrame:
>>> agnostic_lazy(df_pd)
foo bar ham
0 1 6.0 a
1 2 7.0 b
2 3 8.0 c
>>> agnostic_lazy(df_pl)
<LazyFrame ...>
>>> agnostic_lazy(df_pa)
pyarrow.Table
foo: int64
bar: double
ham: string
----
foo: [[1,2,3]]
bar: [[6,7,8]]
ham: [["a","b","c"]]
null_count()
Create a new DataFrame that shows the null counts per column.
Notes
pandas and Polars handle null values differently. Polars distinguishes between NaN and Null, whereas pandas doesn't.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> df_pd = pd.DataFrame(
... {
... "foo": [1, None, 3],
... "bar": [6, 7, None],
... "ham": ["a", "b", "c"],
... }
... )
>>> df_pl = pl.DataFrame(
... {
... "foo": [1, None, 3],
... "bar": [6, 7, None],
... "ham": ["a", "b", "c"],
... }
... )
Let's define a dataframe-agnostic function that returns the null count of each columns:
>>> @nw.narwhalify
... def func(df):
... return df.null_count()
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo bar ham
0 1 1 0
>>> func(df_pl)
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 0 │
└─────┴─────┴─────┘
pipe(function, *args, **kwargs)
Pipe function call.
Examples:
>>> import polars as pl
>>> import pandas as pd
>>> import narwhals as nw
>>> data = {"a": [1, 2, 3], "ba": [4, 5, 6]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.pipe(
... lambda _df: _df.select([x for x in _df.columns if len(x) == 1])
... )
We can then pass either pandas or Polars:
>>> func(df_pd)
a
0 1
1 2
2 3
>>> func(df_pl)
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
pivot(on, *, index=None, values=None, aggregate_function=None, maintain_order=True, sort_columns=False, separator='_')
Create a spreadsheet-style pivot table as a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
on
|
str | list[str]
|
Name of the column(s) whose values will be used as the header of the output DataFrame. |
required |
index
|
str | list[str] | None
|
One or multiple keys to group by. If None, all remaining columns not
specified on |
None
|
values
|
str | list[str] | None
|
One or multiple keys to group by. If None, all remaining columns not
specified on |
None
|
aggregate_function
|
Literal['min', 'max', 'first', 'last', 'sum', 'mean', 'median', 'len'] | None
|
Choose from: - None: no aggregation takes place, will raise error if multiple values are in group. - A predefined aggregate function string, one of {'min', 'max', 'first', 'last', 'sum', 'mean', 'median', 'len'} |
None
|
maintain_order
|
bool
|
Sort the grouped keys so that the output order is predictable. |
True
|
sort_columns
|
bool
|
Sort the transposed columns by name. Default is by order of discovery. |
False
|
separator
|
str
|
Used as separator/delimiter in generated column names in case of
multiple |
'_'
|
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {
... "ix": [1, 1, 2, 2, 1, 2],
... "col": ["a", "a", "a", "a", "b", "b"],
... "foo": [0, 1, 2, 2, 7, 1],
... "bar": [0, 2, 0, 0, 9, 4],
... }
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.pivot("col", index="ix", aggregate_function="sum")
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
ix foo_a foo_b bar_a bar_b
0 1 1 7 2 9
1 2 4 1 0 4
>>> func(df_pl)
shape: (2, 5)
┌─────┬───────┬───────┬───────┬───────┐
│ ix ┆ foo_a ┆ foo_b ┆ bar_a ┆ bar_b │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 1 ┆ 7 ┆ 2 ┆ 9 │
│ 2 ┆ 4 ┆ 1 ┆ 0 ┆ 4 │
└─────┴───────┴───────┴───────┴───────┘
rename(mapping)
Rename column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mapping
|
dict[str, str]
|
Key value pairs that map from old name to new name. |
required |
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.rename({"foo": "apple"})
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
apple bar ham
0 1 6 a
1 2 7 b
2 3 8 c
>>> func(df_pl)
shape: (3, 3)
┌───────┬─────┬─────┐
│ apple ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═══════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
│ 2 ┆ 7 ┆ b │
│ 3 ┆ 8 ┆ c │
└───────┴─────┴─────┘
row(index)
Get values at given row.
Note
You should NEVER use this method to iterate over a DataFrame; if you require row-iteration you should strongly prefer use of iter_rows() instead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
Row number. |
required |
Notes
cuDF doesn't support this method.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {"a": [1, 2, 3], "b": [4, 5, 6]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a library-agnostic function to get the second row.
>>> @nw.narwhalify
... def func(df):
... return df.row(1)
We can then pass pandas / Polars / any other supported library:
>>> func(df_pd)
(2, 5)
>>> func(df_pl)
(2, 5)
rows(*, named=False)
rows(
*, named: Literal[False] = False
) -> list[tuple[Any, ...]]
rows(*, named: Literal[True]) -> list[dict[str, Any]]
rows(
*, named: bool
) -> list[tuple[Any, ...]] | list[dict[str, Any]]
Returns all data in the DataFrame as a list of rows of python-native values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named
|
bool
|
By default, each row is returned as a tuple of values given in the same order as the frame columns. Setting named=True will return rows of dictionaries instead. |
False
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df, *, named):
... return df.rows(named=named)
We can then pass either pandas or Polars to func
:
>>> func(df_pd, named=False)
[(1, 6.0, 'a'), (2, 7.0, 'b'), (3, 8.0, 'c')]
>>> func(df_pd, named=True)
[{'foo': 1, 'bar': 6.0, 'ham': 'a'}, {'foo': 2, 'bar': 7.0, 'ham': 'b'}, {'foo': 3, 'bar': 8.0, 'ham': 'c'}]
>>> func(df_pl, named=False)
[(1, 6.0, 'a'), (2, 7.0, 'b'), (3, 8.0, 'c')]
>>> func(df_pl, named=True)
[{'foo': 1, 'bar': 6.0, 'ham': 'a'}, {'foo': 2, 'bar': 7.0, 'ham': 'b'}, {'foo': 3, 'bar': 8.0, 'ham': 'c'}]
sample(n=None, *, fraction=None, with_replacement=False, seed=None)
Sample from this DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int | None
|
Number of items to return. Cannot be used with fraction. |
None
|
fraction
|
float | None
|
Fraction of items to return. Cannot be used with n. |
None
|
with_replacement
|
bool
|
Allow values to be sampled more than once. |
False
|
seed
|
int | None
|
Seed for the random number generator. If set to None (default), a random seed is generated for each sample operation. |
None
|
Notes
The results may not be consistent across libraries.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {"a": [1, 2, 3, 4], "b": ["x", "y", "x", "y"]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.sample(n=2, seed=123)
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
a b
3 4 y
0 1 x
>>> func(df_pl)
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 2 ┆ y │
│ 3 ┆ x │
└─────┴─────┘
As you can see, by using the same seed, the result will be consistent within the same backend, but not necessarely across different backends.
select(*exprs, **named_exprs)
Select columns from this DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*exprs
|
IntoExpr | Iterable[IntoExpr]
|
Column(s) to select, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. |
()
|
**named_exprs
|
IntoExpr
|
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used. |
{}
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {
... "foo": [1, 2, 3],
... "bar": [6, 7, 8],
... "ham": ["a", "b", "c"],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
Let's define a dataframe-agnostic function in which we pass the name of a column to select that column.
>>> @nw.narwhalify
... def func(df):
... return df.select("foo")
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo
0 1
1 2
2 3
>>> func(df_pl)
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Multiple columns can be selected by passing a list of column names.
>>> @nw.narwhalify
... def func(df):
... return df.select(["foo", "bar"])
>>> func(df_pd)
foo bar
0 1 6
1 2 7
2 3 8
>>> func(df_pl)
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 6 │
│ 2 ┆ 7 │
│ 3 ┆ 8 │
└─────┴─────┘
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> @nw.narwhalify
... def func(df):
... return df.select(nw.col("foo"), nw.col("bar") + 1)
>>> func(df_pd)
foo bar
0 1 7
1 2 8
2 3 9
>>> func(df_pl)
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 7 │
│ 2 ┆ 8 │
│ 3 ┆ 9 │
└─────┴─────┘
Use keyword arguments to easily name your expression inputs.
>>> @nw.narwhalify
... def func(df):
... return df.select(threshold=nw.col("foo") * 2)
>>> func(df_pd)
threshold
0 2
1 4
2 6
>>> func(df_pl)
shape: (3, 1)
┌───────────┐
│ threshold │
│ --- │
│ i64 │
╞═══════════╡
│ 2 │
│ 4 │
│ 6 │
└───────────┘
sort(by, *more_by, descending=False, nulls_last=False)
Sort the dataframe by the given columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
by
|
str | Iterable[str]
|
Column(s) names to sort by. |
required |
*more_by
|
str
|
Additional columns to sort by, specified as positional arguments. |
()
|
descending
|
bool | Sequence[bool]
|
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans. |
False
|
nulls_last
|
bool
|
Place null values last. |
False
|
Warning
Unlike Polars, it is not possible to specify a sequence of booleans for
nulls_last
in order to control per-column behaviour. Instead a single
boolean is applied for all by
columns.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {
... "a": [1, 2, None],
... "b": [6.0, 5.0, 4.0],
... "c": ["a", "c", "b"],
... }
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function in which we sort by multiple columns in different orders
>>> @nw.narwhalify
... def func(df):
... return df.sort("c", "a", descending=[False, True])
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
a b c
0 1.0 6.0 a
2 NaN 4.0 b
1 2.0 5.0 c
>>> func(df_pl)
shape: (3, 3)
┌──────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 1 ┆ 6.0 ┆ a │
│ null ┆ 4.0 ┆ b │
│ 2 ┆ 5.0 ┆ c │
└──────┴─────┴─────┘
tail(n=5)
Get the last n
rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int
|
Number of rows to return. If a negative value is passed, return all rows
except the first |
5
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {
... "foo": [1, 2, 3, 4, 5],
... "bar": [6, 7, 8, 9, 10],
... "ham": ["a", "b", "c", "d", "e"],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
Let's define a dataframe-agnostic function that gets the last 3 rows.
>>> @nw.narwhalify
... def func(df):
... return df.tail(3)
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo bar ham
2 3 8 c
3 4 9 d
4 5 10 e
>>> func(df_pl)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 3 ┆ 8 ┆ c │
│ 4 ┆ 9 ┆ d │
│ 5 ┆ 10 ┆ e │
└─────┴─────┴─────┘
to_arrow()
Convert to arrow table.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> data = {"foo": [1, 2, 3], "bar": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function that converts to arrow table:
>>> @nw.narwhalify
... def func(df):
... return df.to_arrow()
>>> func(df_pd)
pyarrow.Table
foo: int64
bar: string
----
foo: [[1,2,3]]
bar: [["a","b","c"]]
>>> func(df_pl)
pyarrow.Table
foo: int64
bar: large_string
----
foo: [[1,2,3]]
bar: [["a","b","c"]]
to_dict(*, as_series=True)
to_dict(
*, as_series: Literal[True] = ...
) -> dict[str, Series]
to_dict(
*, as_series: Literal[False]
) -> dict[str, list[Any]]
to_dict(
*, as_series: bool
) -> dict[str, Series] | dict[str, list[Any]]
Convert DataFrame to a dictionary mapping column name to values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
as_series
|
bool
|
If set to true |
True
|
Returns:
Type | Description |
---|---|
dict[str, Series] | dict[str, list[Any]]
|
A mapping from column name to values / Series. |
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> from narwhals.typing import IntoDataFrame
>>>
>>> df = {
... "A": [1, 2, 3, 4, 5],
... "fruits": ["banana", "banana", "apple", "apple", "banana"],
... "B": [5, 4, 3, 2, 1],
... "animals": ["beetle", "fly", "beetle", "beetle", "beetle"],
... "optional": [28, 300, None, 2, -30],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def agnostic_to_dict(
... df_native: IntoDataFrame,
... ) -> dict[str, list[int | str | float | None]]:
... df = nw.from_native(df_native)
... return df.to_dict(as_series=False)
We can then pass either pandas, Polars or PyArrow to agnostic_to_dict
:
>>> agnostic_to_dict(df_pd)
{'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'animals': ['beetle', 'fly', 'beetle', 'beetle', 'beetle'], 'optional': [28.0, 300.0, nan, 2.0, -30.0]}
>>> agnostic_to_dict(df_pl)
{'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'animals': ['beetle', 'fly', 'beetle', 'beetle', 'beetle'], 'optional': [28, 300, None, 2, -30]}
>>> agnostic_to_dict(df_pa)
{'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'animals': ['beetle', 'fly', 'beetle', 'beetle', 'beetle'], 'optional': [28, 300, None, 2, -30]}
to_native()
Convert Narwhals DataFrame to native one.
Returns:
Type | Description |
---|---|
DataFrameT
|
Object of class that user started with. |
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> data = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
>>> df_pa = pa.table(data)
Calling to_native
on a Narwhals DataFrame returns the native object:
>>> nw.from_native(df_pd).to_native()
foo bar ham
0 1 6.0 a
1 2 7.0 b
2 3 8.0 c
>>> nw.from_native(df_pl).to_native()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6.0 ┆ a │
│ 2 ┆ 7.0 ┆ b │
│ 3 ┆ 8.0 ┆ c │
└─────┴─────┴─────┘
>>> nw.from_native(df_pa).to_native()
pyarrow.Table
foo: int64
bar: double
ham: string
----
foo: [[1,2,3]]
bar: [[6,7,8]]
ham: [["a","b","c"]]
to_numpy()
Convert this DataFrame to a NumPy ndarray.
Examples:
Construct pandas and polars DataFrames:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> import numpy as np
>>> from narwhals.typing import IntoDataFrame
>>>
>>> df = {"foo": [1, 2, 3], "bar": [6.5, 7.0, 8.5], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def agnostic_to_numpy(df_native: IntoDataFrame) -> np.ndarray:
... df = nw.from_native(df_native)
... return df.to_numpy()
We can then pass either pandas, Polars or PyArrow to agnostic_to_numpy
:
>>> agnostic_to_numpy(df_pd)
array([[1, 6.5, 'a'],
[2, 7.0, 'b'],
[3, 8.5, 'c']], dtype=object)
>>> agnostic_to_numpy(df_pl)
array([[1, 6.5, 'a'],
[2, 7.0, 'b'],
[3, 8.5, 'c']], dtype=object)
>>> agnostic_to_numpy(df_pa)
array([[1, 6.5, 'a'],
[2, 7.0, 'b'],
[3, 8.5, 'c']], dtype=object)
to_pandas()
Convert this DataFrame to a pandas DataFrame.
Examples:
Construct pandas, Polars (eager) and PyArrow DataFrames:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> from narwhals.typing import IntoDataFrame
>>>
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def agnostic_to_pandas(df_native: IntoDataFrame) -> pd.DataFrame:
... df = nw.from_native(df_native)
... return df.to_pandas()
We can then pass any supported library such as pandas, Polars (eager), or PyArrow to agnostic_to_pandas
:
>>> agnostic_to_pandas(df_pd)
foo bar ham
0 1 6.0 a
1 2 7.0 b
2 3 8.0 c
>>> agnostic_to_pandas(df_pl)
foo bar ham
0 1 6.0 a
1 2 7.0 b
2 3 8.0 c
>>> agnostic_to_pandas(df_pa)
foo bar ham
0 1 6.0 a
1 2 7.0 b
2 3 8.0 c
unique(subset=None, *, keep='any', maintain_order=False)
Drop duplicate rows from this dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
subset
|
str | list[str] | None
|
Column name(s) to consider when identifying duplicate rows. |
None
|
keep
|
Literal['any', 'first', 'last', 'none']
|
{'first', 'last', 'any', 'none'} Which of the duplicate rows to keep.
|
'any'
|
maintain_order
|
bool
|
Keep the same order as the original DataFrame. This may be more
expensive to compute. Settings this to |
False
|
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> data = {
... "foo": [1, 2, 3, 1],
... "bar": ["a", "a", "a", "a"],
... "ham": ["b", "b", "b", "b"],
... }
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.unique(["bar", "ham"])
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
foo bar ham
0 1 a b
>>> func(df_pl)
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ a ┆ b │
└─────┴─────┴─────┘
unpivot(on=None, *, index=None, variable_name=None, value_name=None)
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (index) while all other columns, considered measured variables (on), are "unpivoted" to the row axis leaving just two non-identifier columns, 'variable' and 'value'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
on
|
str | list[str] | None
|
Column(s) to use as values variables; if |
None
|
index
|
str | list[str] | None
|
Column(s) to use as identifier variables. |
None
|
variable_name
|
str | None
|
Name to give to the |
None
|
value_name
|
str | None
|
Name to give to the |
None
|
Notes
If you're coming from pandas, this is similar to pandas.DataFrame.melt
,
but with index
replacing id_vars
and on
replacing value_vars
.
In other frameworks, you might know this operation as pivot_longer
.
Examples:
>>> import narwhals as nw
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> data = {
... "a": ["x", "y", "z"],
... "b": [1, 3, 5],
... "c": [2, 4, 6],
... }
We define a library agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.unpivot(on=["b", "c"], index="a")
We can pass any supported library such as pandas, Polars or PyArrow to func
:
>>> func(pl.DataFrame(data))
shape: (6, 3)
┌─────┬──────────┬───────┐
│ a ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════════╪═══════╡
│ x ┆ b ┆ 1 │
│ y ┆ b ┆ 3 │
│ z ┆ b ┆ 5 │
│ x ┆ c ┆ 2 │
│ y ┆ c ┆ 4 │
│ z ┆ c ┆ 6 │
└─────┴──────────┴───────┘
>>> func(pd.DataFrame(data))
a variable value
0 x b 1
1 y b 3
2 z b 5
3 x c 2
4 y c 4
5 z c 6
>>> func(pa.table(data))
pyarrow.Table
a: string
variable: string
value: int64
----
a: [["x","y","z"],["x","y","z"]]
variable: [["b","b","b"],["c","c","c"]]
value: [[1,3,5],[2,4,6]]
with_columns(*exprs, **named_exprs)
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*exprs
|
IntoExpr | Iterable[IntoExpr]
|
Column(s) to add, specified as positional arguments. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals. |
()
|
**named_exprs
|
IntoExpr
|
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
Self
|
A new DataFrame with the columns added. |
Note
Creating a new DataFrame using this method does not create a new copy of existing data.
Examples:
>>> import pandas as pd
>>> import polars as pl
>>> import narwhals as nw
>>> df = {
... "a": [1, 2, 3, 4],
... "b": [0.5, 4, 10, 13],
... "c": [True, True, False, True],
... }
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
Let's define a dataframe-agnostic function in which we pass an expression to add it as a new column:
>>> @nw.narwhalify
... def func(df):
... return df.with_columns((nw.col("a") * 2).alias("a*2"))
We can then pass either pandas or Polars to func
:
>>> func(df_pd)
a b c a*2
0 1 0.5 True 2
1 2 4.0 True 4
2 3 10.0 False 6
3 4 13.0 True 8
>>> func(df_pl)
shape: (4, 4)
┌─────┬──────┬───────┬─────┐
│ a ┆ b ┆ c ┆ a*2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ i64 │
╞═════╪══════╪═══════╪═════╡
│ 1 ┆ 0.5 ┆ true ┆ 2 │
│ 2 ┆ 4.0 ┆ true ┆ 4 │
│ 3 ┆ 10.0 ┆ false ┆ 6 │
│ 4 ┆ 13.0 ┆ true ┆ 8 │
└─────┴──────┴───────┴─────┘
with_row_index(name='index')
Insert column which enumerates rows.
Examples:
Construct pandas as polars DataFrames:
>>> import polars as pl
>>> import pandas as pd
>>> import narwhals as nw
>>> data = {"a": [1, 2, 3], "b": [4, 5, 6]}
>>> df_pd = pd.DataFrame(data)
>>> df_pl = pl.DataFrame(data)
Let's define a dataframe-agnostic function:
>>> @nw.narwhalify
... def func(df):
... return df.with_row_index()
We can then pass either pandas or Polars:
>>> func(df_pd)
index a b
0 0 1 4
1 1 2 5
2 2 3 6
>>> func(df_pl)
shape: (3, 3)
┌───────┬─────┬─────┐
│ index ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════╪═════╪═════╡
│ 0 ┆ 1 ┆ 4 │
│ 1 ┆ 2 ┆ 5 │
│ 2 ┆ 3 ┆ 6 │
└───────┴─────┴─────┘
write_csv(file=None)
Write dataframe to comma-separated values (CSV) file.
Examples:
Construct pandas and Polars DataFrames:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def func(df):
... df = nw.from_native(df)
... return df.write_csv()
We can pass any supported library such as pandas, Polars or PyArrow to func
:
>>> func(df_pd)
'foo,bar,ham\n1,6.0,a\n2,7.0,b\n3,8.0,c\n'
>>> func(df_pl)
'foo,bar,ham\n1,6.0,a\n2,7.0,b\n3,8.0,c\n'
>>> func(df_pa)
'"foo","bar","ham"\n1,6,"a"\n2,7,"b"\n3,8,"c"\n'
If we had passed a file name to write_csv
, it would have been
written to that file.
write_parquet(file)
Write dataframe to parquet file.
Examples:
Construct pandas, Polars and PyArrow DataFrames:
>>> import pandas as pd
>>> import polars as pl
>>> import pyarrow as pa
>>> import narwhals as nw
>>> df = {"foo": [1, 2, 3], "bar": [6.0, 7.0, 8.0], "ham": ["a", "b", "c"]}
>>> df_pd = pd.DataFrame(df)
>>> df_pl = pl.DataFrame(df)
>>> df_pa = pa.table(df)
We define a library agnostic function:
>>> def func(df):
... df = nw.from_native(df)
... df.write_parquet("foo.parquet")
We can then pass either pandas, Polars or PyArrow to func
:
>>> func(df_pd)
>>> func(df_pl)
>>> func(df_pa)