-
Notifications
You must be signed in to change notification settings - Fork 180
Description
pandas is nearing a 3.0 release, with a release candidate being available for testing, see the announcement: https://pandas.pydata.org/community/blog/pandas-3.0-release-candidate.html
This is a quite big release with some potentially breaking changes, mostly because of a new default string dtype (no longer object dtype for strings) and consistent copy/view behaviour with Copy-on-Write (no more SettingWithCopyWarnings, but also chained assignment will never work. Now, given the scope of pyjanitor to enable more method chaining, I assume you won't have issues with that last change).
It would be great if projects depending on pandas could test the upcoming release, to 1) find issues we can still fix, and 2) if some changes are needed to the code, be prepared for that when 3.0 comes out.
Given this is very useful for pandas, I started doing this myself for pyjanitor, and documenting here my findings (and feel free to also update this issue description if you want to further use it for tracking compat with pandas 3.0)
Steps taken:
- Run the tests with the latest 2.3 and resolve deprecation warnings -> this is already done on main, and checking the last run I see two warnings:
-
tests/functions/test_convert_unix_date.py::test_convert_unix_date: The behavior of 'to_datetime' with 'unit' when parsing strings is deprecated. In a future version, strings will be parsed as datetime strings, matching the behavior without a 'unit'. To retain the old behavior, explicitly cast ints or floats to numeric type before calling to_datetime. -
tests/functions/test_coalesce.py::test_coalesce_without_target: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
-
- Run the tests with 2.3 but with future options enabled.
- I only did that for CoW with
PANDAS_COPY_ON_WRITE=warn pixi run -e py312 pytest tests/ --ignore tests/spark, and checked the few cases where it warned for setting on a view that in the future no longer will propagate, but it were all false positives (i.e. cases where we don't care about that) - For the string dtype, you can also directly test with the 3.0 rc now, see below
- I only did that for CoW with
- Run the tests with 3.0 RC. Started with doing that locally, and this yields a whole bunch of failures:
- Upstream regression in boolean filtering, already fixed in the new RC1 (REGR: RangeIndex __getitem__ filtering with boolean extension array of length 1 pandas-dev/pandas#63391)
- A recently added regex in
clean_names([ENH]/[TST] Issue #1257 fix #1538) is causing errors, because lookahead/lookbehind is apparently not supported by the regex engine used by pyarrow. Reported upstream: BUG: string replace results in invalid regular expression: invalid perl operator: (?<= pandas-dev/pandas#63385 - Similarly, usage of
\Zin regex gives problems, see the above linked issue -
tests/functions/test_data_description.py::test_description_listet al: this is failing because the accessor relies on the fact of being cached, which changed. I opened upstream issue REGR: registered accessors are no longer cached per DataFrame pandas-dev/pandas#63393 about this. Feedback would be very welcome -
tests/functions/test_convert_unix_date.py::test_convert_unix_datefailing because of the mentioned deprecation in bullet point 1). If you want to keep this working (although it is deprecated), you have to cast the column to integer before passing toto_datetime(and otherwise it requires a test update to remove the strings from the test data) - I get failures in
tests/functions/test_conditional_join.pybecause of hypothesis creating string data that consists of invalid unicode, and that is now no longer supported in pandas (unless you explicitly specifydtype=object). Opened upstream issue to question if we should do that fallback automatically: Constructor with invalid unicode: automatically fall back to object dtype? pandas-dev/pandas#63396 -
tests/functions/test_concatenate_columns.py::test_concatenate_columns_null_valuesfails becausedf[column_names].astype(str).fillna("")now no longer converts missing values in theastypestep to the string "nan", and so now actually get filled with "". This seems a "good" change, but requires a test update to make the test passing -
tests/functions/test_complete.py::test_fill_value_scalar: there is an explicitastype(object)to construct the expected result, but the actual result now has string dtype, causingassert_frame_equalto fail (I assume just removing the astype should fix it). See the migration guide section- Similar for failures in
tests/functions/test_move.py::test_move_source_target_seq/test_move_source_target_seq_after
- Similar for failures in
-
tests/io/test_tidyxl.py::test_default_values_blank_cells_true: there is an explicit check forNone, but the new string dtype now usesnp.nanas missing value indicator (see the migration guide section) -
tests/functions/test_to_datetime.py::test_to_datetime/tests/functions/test_truncate_datetime.py::test_truncate_datetime_containing_NaT: harcoded check for ns unit, but pandas 3.0 will now default to microseconds (needs a test update) -
tests/functions/test_pivot_longer.py::test_pivot_sort_by_appearanceandtests/functions/test_sort_column_value_order.py::test_sort_column_value_order: still have to figure out why sorting order might have changed (EDIT: this is now marked as xfailed after I fetched the latest changes fromdev) -
tests/functions/test_select_columns.py::test_select_groupby: the creation of the expected result withdataframe.select_dtypes("number").groupby(dataframe["a"]).sum()changed behaviour because of Copy-on-Write:select_dtypesnow returns a shallow copy and not a hard copy, resulting ingroupbyseeingdataframe["a"]as a column of the calling dataframe, and therefore dropping it from the result (as it is already set as index)