-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
df.pivot()
gives incorrect results with a custom aggregation function
#22479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @edschofield for the report Simpler reproducer (perhaps): df = pl.DataFrame({'a': ['x','x','x'], 'b': [1,1,1], 'c': [7,8,9], 'd': [0,2,1]})
df.pivot(on='a',index='b',values='c', aggregate_function=pl.element().sort_by('d')) Outputs: shape: (1, 2)
┌─────┬───────────┐
│ b ┆ x │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1 ┆ [7, 8, 9] │
└─────┴───────────┘ However, sorting column |
I don't mind giving this a closer look if there are no other takers, but may need some help (new to the code base). |
Update. (a) Functions like (b) In addition, the I will see if I can make it work, no promises. @MarcoGorelli if you have any insights or feedback, that would be appreciated. We can move to a draft PR if you prefer. |
Small update - with the proposed approach (not production grade code), I get the expected result, at least for the
Next steps: investigate |
Well done and thanks for your work on this, @kdn36! This looks promising... |
Update. In addition to (a) and (b), we have: With a modified version, case 1 now succeeds:
Case 2 still panics because of the different syntax (it explicitly references the column in Two approaches to fix this:
I think 1) is the right thing to do, but I will need core team input. Will shift this discussion to an upcoming PR. |
Uh oh!
There was an error while loading. Please reload this page.
Checks
Reproducible example
Consider this toy dataset:
Suppose we want a pivot table containing the heaviest car for each year and origin.
This code achieves this by using
.group_by()
with a custom aggregation function, then a simple.pivot()
:Correct result
Question
Can we now go directly to this result, without using
.group_by()
, by passing a custom aggregation expression to.pivot()
?The docs for pivot() imply it's possible to pass a custom aggregation expression using
pl.element()
.Case 1
This code doesn't raise an exception but instead produces an incorrect result:
Incorrect result
Case 2
The "obvious" solution also doesn't raise an exception and instead produces the same incorrect result:
Case 3
Note that the
by=
argument to.top_k_by()
is being completely ignored:The column passed as
by=...
is ignored when the custom expression is passed as anaggregate_function
to.pivot()
, even if the column is nonexistent.(Note that, unlike
.pivot()
,.group_by()
raises aColumnNotFoundError
in these cases as expected.)At least these expression methods are affected when used with
.pivot()
:.top_k_by()
.bottom_k_by()
.sort_by()
Issue description
If this kind of aggregation function is not supported by
.pivot()
, Polars should raise an exception for all three cases above.It seems that aggregation with
.pivot()
is provided as a convenience, but.pivot()
only works in eager mode, so it is probably best practice to use.group_by()
for as long as possible anyway (including any non-trivial aggregations). If.pivot()
will always be too limited to support the simple scenario outlined here, the docs' promise thataggregate_function
can take an arbitrary expression is misleading. Perhaps the feature is more liability than convenience. Having duplicated functionality that is incomplete and broken is worse than not having it at all.So I would argue that Polars should deprecate passing a custom expression as
aggregate_function
in.pivot()
and instead support only the predefined aggregation function strings. The docs should then point users requiring custom aggregation to.group_by()
.Expected behavior
Polars should raise an exception for all three cases above, or return the correct results for cases 1 and 2.
Installed versions
The text was updated successfully, but these errors were encountered: