`df.pivot()` gives incorrect results with a custom aggregation function #22479

edschofield · 2025-04-29T05:41:41Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Consider this toy dataset:

import polars as pl

cars = pl.DataFrame([
    {'year': 1970, 'origin': 'EU', 'name': 'volkswagen 1131 deluxe sedan', 'weight': 1835},
    {'year': 1970, 'origin': 'EU', 'name': 'peugeot 504', 'weight': 2672},
    {'year': 1970, 'origin': 'JP', 'name': 'toyota corona mark ii', 'weight': 2372},
    {'year': 1970, 'origin': 'JP', 'name': 'datsun pl510', 'weight': 2130},
    {'year': 1971, 'origin': 'EU', 'name': 'opel 1900', 'weight': 2123},
    {'year': 1971, 'origin': 'EU', 'name': 'peugeot 304', 'weight': 2074},
    {'year': 1971, 'origin': 'JP', 'name': 'datsun pl510', 'weight': 2130},
    {'year': 1971, 'origin': 'JP', 'name': 'toyota corona', 'weight': 2228}
])

Suppose we want a pivot table containing the heaviest car for each year and origin.

This code achieves this by using .group_by() with a custom aggregation function, then a simple .pivot():

heaviest_car_expr = pl.col('name').top_k_by(k=1, by='weight').first().alias('heaviest car')

heaviest_cars = cars.group_by(
    'year', 'origin'
).agg(
    heaviest_car_expr
).sort('year', 'origin').pivot(
    on='origin', index='year'
)

Correct result

year	EU	JP
1970	peugeot 504	toyota corona mark ii
1971	opel 1900	toyota corona

Question

Can we now go directly to this result, without using .group_by(), by passing a custom aggregation expression to .pivot()?

The docs for pivot() imply it's possible to pass a custom aggregation expression using pl.element().

Case 1

This code doesn't raise an exception but instead produces an incorrect result:

incorrect_heaviest_cars_1 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=pl.element().top_k_by(by='weight', k=1).first(), sort_columns=True
)

Incorrect result

year	EU	JP
1970	volkswagen 1131 deluxe sedan	toyota corona mark ii
1971	peugeot 304	toyota corona

Case 2

The "obvious" solution also doesn't raise an exception and instead produces the same incorrect result:

incorrect_heaviest_cars_2 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=heaviest_car_expr, sort_columns=True
)

Case 3

Note that the by= argument to .top_k_by() is being completely ignored:

broken_heaviest_car_expr = pl.element().top_k_by(by='THIS COLUMN DOES NOT EXIST', k=1).first()
incorrect_heaviest_cars_3 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=broken_heaviest_car_expr, sort_columns=True
)

The column passed as by=... is ignored when the custom expression is passed as an aggregate_function to .pivot(), even if the column is nonexistent.

(Note that, unlike .pivot(), .group_by() raises a ColumnNotFoundError in these cases as expected.)

At least these expression methods are affected when used with .pivot():

.top_k_by()
.bottom_k_by()
.sort_by()

Issue description

If this kind of aggregation function is not supported by .pivot(), Polars should raise an exception for all three cases above.

It seems that aggregation with .pivot() is provided as a convenience, but .pivot() only works in eager mode, so it is probably best practice to use .group_by() for as long as possible anyway (including any non-trivial aggregations). If .pivot() will always be too limited to support the simple scenario outlined here, the docs' promise that aggregate_function can take an arbitrary expression is misleading. Perhaps the feature is more liability than convenience. Having duplicated functionality that is incomplete and broken is worse than not having it at all.

So I would argue that Polars should deprecate passing a custom expression as aggregate_function in .pivot() and instead support only the predefined aggregation function strings. The docs should then point users requiring custom aggregation to .group_by().

Expected behavior

Polars should raise an exception for all three cases above, or return the correct results for cases 1 and 2.

Installed versions

--------Version info---------
Polars:              1.28.1
Index type:          UInt32
Platform:            macOS-10.16-x86_64-i386-64bit
Python:              3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 14:43:12) [Clang 14.0.6 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  0.11.0
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.0.0
connectorx           0.3.3
deltalake            0.17.4
fastexcel            0.10.4
fsspec               2025.3.0
gevent               24.2.1
google.auth          <not installed>
great_tables         0.17.0
matplotlib           3.10.1
numpy                2.1.3
openpyxl             3.1.5
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              16.1.0
pydantic             2.10.6
pyiceberg            0.6.1
sqlalchemy           2.0.40
torch                <not installed>
xlsx2csv             0.8.2
xlsxwriter           3.2.0

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2025-04-29T06:50:14Z

Thanks @edschofield for the report

Simpler reproducer (perhaps):

df = pl.DataFrame({'a': ['x','x','x'], 'b': [1,1,1], 'c': [7,8,9], 'd': [0,2,1]})
df.pivot(on='a',index='b',values='c', aggregate_function=pl.element().sort_by('d'))

Outputs:

shape: (1, 2)
┌─────┬───────────┐
│ b   ┆ x         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [7, 8, 9] │
└─────┴───────────┘

However, sorting column 'c' by column 'd' should output [7, 9, 8]

kdn36 · 2025-05-06T13:45:09Z

I don't mind giving this a closer look if there are no other takers, but may need some help (new to the code base).

kdn36 · 2025-05-07T19:47:16Z

Update.

(a) Functions like sort_by and alike reference another column ("d" in the example). This column gets reset to "" in https://github.com/pola-rs/polars/blob/main/crates/polars-lazy/src/physical_plan/exotic.rs#L7, which is called from pivot_stable() in https://github.com/pola-rs/polars/blob/main/crates/polars-lazy/src/frame/pivot.rs#L84-L88. This explains why these arguments get ignored, even when a non-existent column name is provided.

(b) In addition, the tmp_df dataframe passed to evaluate only contains the values column. Therefore, it is unable to evaluate the expression in the case the column names would have been preserved. A quick modification to the code illustrates this: we now get (polars.exceptions.ColumnNotFoundError: "d" not found, from evaluate() in impl PhysicalExpr for ColumnExpr in column.rs. To be confirmed.

I will see if I can make it work, no promises.

@MarcoGorelli if you have any insights or feedback, that would be appreciated. We can move to a draft PR if you prefer.

kdn36 · 2025-05-08T19:50:52Z

Small update - with the proposed approach (not production grade code), I get the expected result, at least for the sort_by MRE:

shape: (1, 2)
┌─────┬───────────┐
│ b   ┆ x         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [7, 9, 8] │
└─────┴───────────┘

Next steps: investigate top_k_by, proper code, PR.

edschofield · 2025-05-09T00:05:57Z

Well done and thanks for your work on this, @kdn36! This looks promising...

kdn36 · 2025-05-11T17:58:23Z

Update.

In addition to (a) and (b), we have:
(c) The functions sortby() and top_k_by() are implemented differently: the first one is an explicit AExpr::SortBy and the second one is a more generic AExpr::Function variant. As part of evaluating expr inside PivotExpr, the lazyframe gets optimized with a dummy single-column dataframe based on the values dtype. The more generic Function variant wants to access all referenced columns of the original dataframe as part of this process (and fails because the dummy dataframe does not have this), while the explicit SortBy variant has what it needs without those lookups.

With a modified version, case 1 now succeeds:

baseline (group_by):
shape: (2, 3)
┌──────┬─────────────┬───────────────────────┐
│ year ┆ EU          ┆ JP                    │
│ ---  ┆ ---         ┆ ---                   │
│ i64  ┆ str         ┆ str                   │
╞══════╪═════════════╪═══════════════════════╡
│ 1970 ┆ peugeot 504 ┆ toyota corona mark ii │
│ 1971 ┆ opel 1900   ┆ toyota corona         │
└──────┴─────────────┴───────────────────────┘

case1:
shape: (2, 3)
┌──────┬─────────────┬───────────────────────┐
│ year ┆ EU          ┆ JP                    │
│ ---  ┆ ---         ┆ ---                   │
│ i64  ┆ str         ┆ str                   │
╞══════╪═════════════╪═══════════════════════╡
│ 1970 ┆ peugeot 504 ┆ toyota corona mark ii │
│ 1971 ┆ opel 1900   ┆ toyota corona         │
└──────┴─────────────┴───────────────────────┘

Case 2 still panics because of the different syntax (it explicitly references the column in pl.col('name')), that is fixable.
Case 3 panics (better than ignoring), so that needs to be converted to ColumnError, doable.

Two approaches to fix this:

Carry the full dataframe (schema) throughout the planning cycle.
Refine the logic in to_field_impl to only consider the input column when determining the result field of the Function (current modification).

I think 1) is the right thing to do, but I will need core team input. Will shift this discussion to an upcoming PR.

edschofield added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 29, 2025

MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Apr 29, 2025

MarcoGorelli added the accepted Ready for implementation label Apr 29, 2025

kdn36 linked a pull request May 12, 2025 that will close this issue

fix: Incorrect results from df.pivot() with custom aggregation #22718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`df.pivot()` gives incorrect results with a custom aggregation function #22479

`df.pivot()` gives incorrect results with a custom aggregation function #22479

edschofield commented Apr 29, 2025 •

edited

Loading

MarcoGorelli commented Apr 29, 2025

Uh oh!

kdn36 commented May 6, 2025

Uh oh!

kdn36 commented May 7, 2025 •

edited

Loading

Uh oh!

kdn36 commented May 8, 2025

Uh oh!

edschofield commented May 9, 2025

Uh oh!

kdn36 commented May 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

df.pivot() gives incorrect results with a custom aggregation function #22479

df.pivot() gives incorrect results with a custom aggregation function #22479

Comments

edschofield commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checks

Reproducible example

Correct result

Question

Case 1

Incorrect result

Case 2

Case 3

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Apr 29, 2025

Uh oh!

kdn36 commented May 6, 2025

Uh oh!

kdn36 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kdn36 commented May 8, 2025

Uh oh!

edschofield commented May 9, 2025

Uh oh!

kdn36 commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

`df.pivot()` gives incorrect results with a custom aggregation function #22479

`df.pivot()` gives incorrect results with a custom aggregation function #22479

edschofield commented Apr 29, 2025 •

edited

Loading

kdn36 commented May 7, 2025 •

edited

Loading

kdn36 commented May 11, 2025 •

edited

Loading