You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be useful to retain keys used in a join instead of automatically coalescing them. This is most useful in full outer joins. I am happy to implement myself :)
This adds a coalesce_keys keyword to DataFrame.join to allow
preservation of both join key columns (id and id_right),
instead of automatically coalescing them into a single column.
This is especially useful in full outer joins, where retaining
information about unmatched keys from both sides is important.
Example:
df1.join(df2, on=id, coalesce_keys=False)
This will result in both id and id_right columns being preserved,
rather than merged into a single id.
Includes:
- Modifications to join internals (core/reshape/merge.py)
- A dedicated test file (test_merge_coalesce.py) covering:
- Preservation of join keys when coalesce_keys=False
- Comparison with default behavior (coalesce_keys=True)
- Full outer joins with asymmetric key presence
Co-authored-by: Maria Pereira <[email protected]>
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
It would be useful to retain keys used in a join instead of automatically coalescing them. This is most useful in full outer joins. I am happy to implement myself :)
Feature Description
A test for this would pass w/the below data.
df1 = {"id": [1, 2, 3], "value1": ["A", "B", "C"]}
df2 = {"id": [2, 3, 4], "value2": ["X", "Y", "Z"]}
res = df1.join(df2, on = 'id', coalesce_keys = False)
Note the preservation of the id columns:
expected_no_coalesce = {
"id": [None, 1, 2, 3],
"value1": [None, "A", "B", "C"],
"id_right": [4, None, 2, 3],
"value2": ["Z", None, "X", "Y"],
}
Alternative Solutions
Arrow and polars have this option. I bring this up because I'm implementing a common full join where keys are preserved in the Narwhals package and noticed Pandas does not allow this out of the box. https://github.com/narwhals-dev/narwhals/pull/2126/files#diff-ff8314856956318d0da461d7cc2710a6b18d3c052581be7990ae0023a9e689ee
Additional Context
No response
The text was updated successfully, but these errors were encountered: