Skip to content

df.pivot() gives incorrect results with a custom aggregation function #22479

@edschofield

Description

@edschofield

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Consider this toy dataset:

import polars as pl

cars = pl.DataFrame([
    {'year': 1970, 'origin': 'EU', 'name': 'volkswagen 1131 deluxe sedan', 'weight': 1835},
    {'year': 1970, 'origin': 'EU', 'name': 'peugeot 504', 'weight': 2672},
    {'year': 1970, 'origin': 'JP', 'name': 'toyota corona mark ii', 'weight': 2372},
    {'year': 1970, 'origin': 'JP', 'name': 'datsun pl510', 'weight': 2130},
    {'year': 1971, 'origin': 'EU', 'name': 'opel 1900', 'weight': 2123},
    {'year': 1971, 'origin': 'EU', 'name': 'peugeot 304', 'weight': 2074},
    {'year': 1971, 'origin': 'JP', 'name': 'datsun pl510', 'weight': 2130},
    {'year': 1971, 'origin': 'JP', 'name': 'toyota corona', 'weight': 2228}
])

Suppose we want a pivot table containing the heaviest car for each year and origin.

This code achieves this by using .group_by() with a custom aggregation function, then a simple .pivot():

heaviest_car_expr = pl.col('name').top_k_by(k=1, by='weight').first().alias('heaviest car')

heaviest_cars = cars.group_by(
    'year', 'origin'
).agg(
    heaviest_car_expr
).sort('year', 'origin').pivot(
    on='origin', index='year'
)

Correct result

year EU JP
1970 peugeot 504 toyota corona mark ii
1971 opel 1900 toyota corona

Question

Can we now go directly to this result, without using .group_by(), by passing a custom aggregation expression to .pivot()?

The docs for pivot() imply it's possible to pass a custom aggregation expression using pl.element().

Case 1

This code doesn't raise an exception but instead produces an incorrect result:

incorrect_heaviest_cars_1 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=pl.element().top_k_by(by='weight', k=1).first(), sort_columns=True
)

Incorrect result

year EU JP
1970 volkswagen 1131 deluxe sedan toyota corona mark ii
1971 peugeot 304 toyota corona

Case 2

The "obvious" solution also doesn't raise an exception and instead produces the same incorrect result:

incorrect_heaviest_cars_2 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=heaviest_car_expr, sort_columns=True
)

Case 3

Note that the by= argument to .top_k_by() is being completely ignored:

broken_heaviest_car_expr = pl.element().top_k_by(by='THIS COLUMN DOES NOT EXIST', k=1).first()
incorrect_heaviest_cars_3 = cars.pivot(
    on='origin', index='year', values='name', aggregate_function=broken_heaviest_car_expr, sort_columns=True
)

The column passed as by=... is ignored when the custom expression is passed as an aggregate_function to .pivot(), even if the column is nonexistent.

(Note that, unlike .pivot(), .group_by() raises a ColumnNotFoundError in these cases as expected.)

At least these expression methods are affected when used with .pivot():

  • .top_k_by()
  • .bottom_k_by()
  • .sort_by()

Issue description

If this kind of aggregation function is not supported by .pivot(), Polars should raise an exception for all three cases above.

It seems that aggregation with .pivot() is provided as a convenience, but .pivot() only works in eager mode, so it is probably best practice to use .group_by() for as long as possible anyway (including any non-trivial aggregations). If .pivot() will always be too limited to support the simple scenario outlined here, the docs' promise that aggregate_function can take an arbitrary expression is misleading. Perhaps the feature is more liability than convenience. Having duplicated functionality that is incomplete and broken is worse than not having it at all.

So I would argue that Polars should deprecate passing a custom expression as aggregate_function in .pivot() and instead support only the predefined aggregation function strings. The docs should then point users requiring custom aggregation to .group_by().

Expected behavior

Polars should raise an exception for all three cases above, or return the correct results for cases 1 and 2.

Installed versions

Details
--------Version info---------
Polars:              1.28.1
Index type:          UInt32
Platform:            macOS-10.16-x86_64-i386-64bit
Python:              3.12.3 | packaged by Anaconda, Inc. | (main, May  6 2024, 14:43:12) [Clang 14.0.6 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  0.11.0
altair               5.5.0
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.0.0
connectorx           0.3.3
deltalake            0.17.4
fastexcel            0.10.4
fsspec               2025.3.0
gevent               24.2.1
google.auth          <not installed>
great_tables         0.17.0
matplotlib           3.10.1
numpy                2.1.3
openpyxl             3.1.5
pandas               2.2.3
polars_cloud         <not installed>
pyarrow              16.1.0
pydantic             2.10.6
pyiceberg            0.6.1
sqlalchemy           2.0.40
torch                <not installed>
xlsx2csv             0.8.2
xlsxwriter           3.2.0

Metadata

Metadata

Assignees

Labels

P-highPriority: highacceptedReady for implementationbugSomething isn't workingpythonRelated to Python Polars

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions