-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Raise clear error for duplicate id_vars in melt (GH61475) #61484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pandas/core/reshape/melt.py
Outdated
# GH61475 - prevent AttributeError when duplicate column | ||
if not hasattr(id_data, "dtype"): | ||
raise Exception(f"{col} is a duplicate column header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This should check
if not frame.columns.is_unique
at the beginning of the function. - A
ValueError
is more appropriate here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I've moved the check for not frame.columns.is_unique
to the beginning of the function and updated the exception type to ValueError
as suggested.
A quick clarification question: currently melt
allows duplicate column names in 'value_vars', as seen in the test test_melt_with_duplicate_columns
.With this change, are we treating any duplicate columns in the input DataFrame
as a ValueError
? Not just when the duplicates appear in id_vars
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. I guess this specifically when id_vars
is not empty we'll want to raise if not frame.columns.is_unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Just to make sure I understand:
You're saying that we should only raise an error in duplicate column names if the duplicate is in id_vars
, correct? Essentially if the duplicates are in value_vars
then, we can let the melt function work as is, as long as no errors are occuring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated logic to raise a ValueError
when there are duplicates columns in id_vars
and not in value_vars
, as discussed and moved it to the beginning of the melt function.
I considered using the frames.columns.is_unique
, but that would raise for any duplicate name - even in value_vars
which is supported. To keep the behavior consistent, I checked only for duplicate column names referenced in id_vars
. Let me know if further adjustment is needed!
pandas/core/reshape/melt.py
Outdated
@@ -173,6 +173,9 @@ def melt( | |||
1 b B E 3 | |||
2 c B E 5 | |||
""" | |||
# GH61475 - prevent AttributeError when duplicate column in id_vars | |||
if id_vars and any(frame.columns.tolist().count(col) > 1 for col in id_vars): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you move this after the
id_vars = ensure_list_vars(id_vars, "id_vars", frame.columns)
line - Specify this as
if id_vars and not frame.columns.is_unique and any(id_var in frame.column for id_var in id_vars):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I added not frame.columns.is_unique
at the beginning of the function, wouldn't it cause test_melt_with_duplicate_columns
to fail, since that test currently allows duplicate column names in value_vars
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. You'll need to run the test to find out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added not frame.columns.is_unique
at and it caused test_melt_with_duplicate_columns
to fail since that test currently allows duplicate column names in value_vars
, so we might need to handle duplicates differently since we allow certain duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok you'll need to check that the id_var
values is actually a duplicate in the columns.
You can use len(frame.columns.get_indexer_for(id_vars)) > len(id_vars)
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.