Skip to content

df.pivot() gives incorrect results with a custom aggregation function #22479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
edschofield opened this issue Apr 29, 2025 · 6 comments · May be fixed by #22718
Open
2 tasks done

df.pivot() gives incorrect results with a custom aggregation function #22479

edschofield opened this issue Apr 29, 2025 · 6 comments · May be fixed by #22718
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@edschofield
Copy link

edschofield commented Apr 29, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Consider this toy dataset:

import polars as pl

cars = pl.DataFrame([
    {'year': 1970, 'origin': 'EU', 'name': 'volkswagen 1131 deluxe sedan', 'weight': 1835},
    {'year': 1970, 'origin': 'EU', 'name': 'peugeot 504', 'weight': 2672},
    {'year': 1970, 'origin': 'JP', 'name': 'toyota corona mark ii', 'weight': 2372},
    {'year': 1970, 'origin': 'JP', 'name': 'datsun pl510', 'weight': 2130},
    {'year': 1971, 'origin': 'EU', 'name': 'opel 1900', 'weight': 2123},
    {'year': 1971, 'origin': 'EU', 'name': 'peugeot 304', 'weight': 2074},
    {'year': 1971, 'origin': 'JP', 'name': 'datsun pl510', 'weight': 2130},
    {'year': 1971, 'origin': 'JP', 'name': 'toyota corona', 'weight': 2228}
])

Suppose we want a pivot table containing the heaviest car for each year and origin.

This code achieves this by using .group_by() with a custom aggregation function, then a simple .pivot():

heaviest_car_expr = pl.col('name').top_k_by(k=1, by='weight').first().alias('heaviest car')

heaviest_cars = cars.group_by(
    'year', 'origin'
).agg(
    heaviest_car_expr
).sort('year', 'origin').pivot(
    on='origin', index='year'
)

Correct result

year EU JP
1970 peugeot 504 toyota corona mark ii
1971 opel 1900 toyota corona

Question

Can we now go directly to this result, without using .group_by(), by passing a custom aggregation expression to .pivot()?

The docs for pivot() imply it's possible to pass a custom aggregation expression using pl.element().

Case 1

This code doesn't raise an exception but instead produces an incorrect result:

incorrect_heaviest_cars_1 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=pl.element().top_k_by(by='weight', k=1).first(), sort_columns=True
)

Incorrect result

year EU JP
1970 volkswagen 1131 deluxe sedan toyota corona mark ii
1971 peugeot 304 toyota corona

Case 2

The "obvious" solution also doesn't raise an exception and instead produces the same incorrect result:

incorrect_heaviest_cars_2 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=heaviest_car_expr, sort_columns=True
)

Case 3

Note that the by= argument to .top_k_by() is being completely ignored:

broken_heaviest_car_expr = pl.element().top_k_by(by='THIS COLUMN DOES NOT EXIST', k=1).first()
incorrect_heaviest_cars_3 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=broken_heaviest_car_expr, sort_columns=True
)

The column passed as by=... is ignored when the custom expression is passed as an aggregate_function to .pivot(), even if the column is nonexistent.

(Note that, unlike .pivot(), .group_by() raises a ColumnNotFoundError in these cases as expected.)

At least these expression methods are affected when used with .pivot():

  • .top_k_by()
  • .bottom_k_by()
  • .sort_by()

Issue description

If this kind of aggregation function is not supported by .pivot(), Polars should raise an exception for all three cases above.

It seems that aggregation with .pivot() is provided as a convenience, but .pivot() only works in eager mode, so it is probably best practice to use .group_by() for as long as possible anyway (including any non-trivial aggregations). If .pivot() will always be too limited to support the simple scenario outlined here, the docs' promise that aggregate_function can take an arbitrary expression is misleading. Perhaps the feature is more liability than convenience. Having duplicated functionality that is incomplete and broken is worse than not having it at all.

So I would argue that Polars should deprecate passing a custom expression as aggregate_function in .pivot() and instead support only the predefined aggregation function strings. The docs should then point users requiring custom aggregation to .group_by().

Expected behavior

Polars should raise an exception for all three cases above, or return the correct results for cases 1 and 2.

Installed versions

--------Version info---------
Polars:              1.28.1
Index type:          UInt32
Platform:            macOS-10.16-x86_64-i386-64bit
Python:              3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 14:43:12) [Clang 14.0.6 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  0.11.0
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.0.0
connectorx           0.3.3
deltalake            0.17.4
fastexcel            0.10.4
fsspec               2025.3.0
gevent               24.2.1
google.auth          <not installed>
great_tables         0.17.0
matplotlib           3.10.1
numpy                2.1.3
openpyxl             3.1.5
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              16.1.0
pydantic             2.10.6
pyiceberg            0.6.1
sqlalchemy           2.0.40
torch                <not installed>
xlsx2csv             0.8.2
xlsxwriter           3.2.0
@edschofield edschofield added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 29, 2025
@MarcoGorelli MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Apr 29, 2025
@MarcoGorelli
Copy link
Collaborator

Thanks @edschofield for the report

Simpler reproducer (perhaps):

df = pl.DataFrame({'a': ['x','x','x'], 'b': [1,1,1], 'c': [7,8,9], 'd': [0,2,1]})
df.pivot(on='a',index='b',values='c', aggregate_function=pl.element().sort_by('d'))

Outputs:

shape: (1, 2)
┌─────┬───────────┐
│ bx         │
│ ------       │
│ i64list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [7, 8, 9] │
└─────┴───────────┘

However, sorting column 'c' by column 'd' should output [7, 9, 8]

@MarcoGorelli MarcoGorelli added the accepted Ready for implementation label Apr 29, 2025
@kdn36
Copy link
Contributor

kdn36 commented May 6, 2025

I don't mind giving this a closer look if there are no other takers, but may need some help (new to the code base).

@kdn36
Copy link
Contributor

kdn36 commented May 7, 2025

Update.

(a) Functions like sort_by and alike reference another column ("d" in the example). This column gets reset to "" in https://github.com/pola-rs/polars/blob/main/crates/polars-lazy/src/physical_plan/exotic.rs#L7, which is called from pivot_stable() in https://github.com/pola-rs/polars/blob/main/crates/polars-lazy/src/frame/pivot.rs#L84-L88. This explains why these arguments get ignored, even when a non-existent column name is provided.

(b) In addition, the tmp_df dataframe passed to evaluate only contains the values column. Therefore, it is unable to evaluate the expression in the case the column names would have been preserved. A quick modification to the code illustrates this: we now get (polars.exceptions.ColumnNotFoundError: "d" not found, from evaluate() in impl PhysicalExpr for ColumnExpr in column.rs. To be confirmed.

I will see if I can make it work, no promises.

@MarcoGorelli if you have any insights or feedback, that would be appreciated. We can move to a draft PR if you prefer.

@kdn36
Copy link
Contributor

kdn36 commented May 8, 2025

Small update - with the proposed approach (not production grade code), I get the expected result, at least for the sort_by MRE:

shape: (1, 2)
┌─────┬───────────┐
│ b   ┆ x         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [7, 9, 8] │
└─────┴───────────┘

Next steps: investigate top_k_by, proper code, PR.

@edschofield
Copy link
Author

Well done and thanks for your work on this, @kdn36! This looks promising...

@kdn36
Copy link
Contributor

kdn36 commented May 11, 2025

Update.

In addition to (a) and (b), we have:
(c) The functions sortby() and top_k_by() are implemented differently: the first one is an explicit AExpr::SortBy and the second one is a more generic AExpr::Function variant. As part of evaluating expr inside PivotExpr, the lazyframe gets optimized with a dummy single-column dataframe based on the values dtype. The more generic Function variant wants to access all referenced columns of the original dataframe as part of this process (and fails because the dummy dataframe does not have this), while the explicit SortBy variant has what it needs without those lookups.

With a modified version, case 1 now succeeds:

baseline (group_by):
shape: (2, 3)
┌──────┬─────────────┬───────────────────────┐
│ year ┆ EU          ┆ JP                    │
│ ---  ┆ ---         ┆ ---                   │
│ i64  ┆ str         ┆ str                   │
╞══════╪═════════════╪═══════════════════════╡
│ 1970 ┆ peugeot 504 ┆ toyota corona mark ii │
│ 1971 ┆ opel 1900   ┆ toyota corona         │
└──────┴─────────────┴───────────────────────┘

case1:
shape: (2, 3)
┌──────┬─────────────┬───────────────────────┐
│ year ┆ EU          ┆ JP                    │
│ ---  ┆ ---         ┆ ---                   │
│ i64  ┆ str         ┆ str                   │
╞══════╪═════════════╪═══════════════════════╡
│ 1970 ┆ peugeot 504 ┆ toyota corona mark ii │
│ 1971 ┆ opel 1900   ┆ toyota corona         │
└──────┴─────────────┴───────────────────────┘

Case 2 still panics because of the different syntax (it explicitly references the column in pl.col('name')), that is fixable.
Case 3 panics (better than ignoring), so that needs to be converted to ColumnError, doable.

Two approaches to fix this:

  1. Carry the full dataframe (schema) throughout the planning cycle.
  2. Refine the logic in to_field_impl to only consider the input column when determining the result field of the Function (current modification).

I think 1) is the right thing to do, but I will need core team input. Will shift this discussion to an upcoming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants