Test / update for the upcoming pandas 3.0 release

pandas is nearing a 3.0 release, with a release candidate being available for testing, see the announcement: https://pandas.pydata.org/community/blog/pandas-3.0-release-candidate.html

This is a quite big release with some potentially breaking changes, mostly because of a new default string dtype (no longer `object` dtype for strings) and consistent copy/view behaviour with Copy-on-Write (no more SettingWithCopyWarnings, but also chained assignment will never work. Now, given the scope of pyjanitor to enable more method chaining, I assume you won't have issues with that last change).

It would be great if projects depending on pandas could test the upcoming release, to 1) find issues we can still fix, and 2) if some changes are needed to the code, be prepared for that when 3.0 comes out.

Given this is very useful for pandas, I started doing this myself for pyjanitor, and documenting here my findings (and feel free to also update this issue description if you want to further use it for tracking compat with pandas 3.0)

Steps taken:

1) Run the tests with the latest 2.3 and resolve deprecation warnings -> this is already done on main, and checking the [last run](https://github.com/pyjanitor-devs/pyjanitor/actions/runs/20417163570/job/58662499847) I see two warnings:
   - [ ] `tests/functions/test_convert_unix_date.py::test_convert_unix_date`: The behavior of 'to_datetime' with 'unit' when parsing strings is deprecated. In a future version, strings will be parsed as datetime strings, matching the behavior without a 'unit'. To retain the old behavior, explicitly cast ints or floats to numeric type before calling to_datetime.
   - [ ] `tests/functions/test_coalesce.py::test_coalesce_without_target`:  The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
2) Run the tests with 2.3 but with future options enabled.
    - I only did that for CoW with `PANDAS_COPY_ON_WRITE=warn pixi run -e py312  pytest tests/ --ignore tests/spark`, and checked the few cases where it warned for setting on a view that in the future no longer will propagate, but it were all false positives (i.e. cases where we don't care about that)
    - For the string dtype, you can also directly test with the 3.0 rc now, see below
3) Run the tests with 3.0 RC. Started with doing that locally, and this yields a whole bunch of failures:
   - [x] Upstream regression in boolean filtering, already fixed in the new RC1 (https://github.com/pandas-dev/pandas/issues/63391)
   - [ ] A recently added regex in `clean_names` (https://github.com/pyjanitor-devs/pyjanitor/pull/1538) is causing errors, because lookahead/lookbehind is apparently not supported by the regex engine used by pyarrow. Reported upstream: https://github.com/pandas-dev/pandas/issues/63385
   - [ ] Similarly, usage of `\Z` in regex gives problems, see the above linked issue
   - [ ] `tests/functions/test_data_description.py::test_description_list` et al: this is failing because the accessor relies on the fact of being cached, which changed. I opened upstream issue https://github.com/pandas-dev/pandas/issues/63393 about this. Feedback would be very welcome
   - [ ] `tests/functions/test_convert_unix_date.py::test_convert_unix_date` failing because of the mentioned deprecation in bullet point 1). If you want to keep this working (although it is deprecated), you have to cast the column to integer before passing to `to_datetime` (and otherwise it requires a test update to remove the strings from the test data)
   - [ ] I get failures in `tests/functions/test_conditional_join.py` because of hypothesis creating string data that consists of invalid unicode, and that is now no longer supported in pandas (unless you explicitly specify `dtype=object`). Opened upstream issue to question if we should do that fallback automatically: https://github.com/pandas-dev/pandas/issues/63396
   - [ ] `tests/functions/test_concatenate_columns.py::test_concatenate_columns_null_values` fails because `df[column_names].astype(str).fillna("")` now no longer converts missing values in the `astype` step to the string "nan", and so now actually get filled with "". This seems a "good" change, but requires a test update to make the test passing
   - [ ] `tests/functions/test_complete.py::test_fill_value_scalar`: there is an explicit `astype(object)` to construct the expected result, but the actual result now has string dtype, causing `assert_frame_equal` to fail (I assume just removing the astype should fix it). See the [migration guide section](https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html#the-dtype-is-no-longer-a-numpy-object-dtype)
     - Similar for failures in `tests/functions/test_move.py::test_move_source_target_seq/test_move_source_target_seq_after` 
   - [ ] `tests/io/test_tidyxl.py::test_default_values_blank_cells_true`: there is an explicit check for `None`, but the new string dtype now uses `np.nan` as missing value indicator (see the [migration guide section](tests/io/test_tidyxl.py::test_default_values_blank_cells_true))
   - [ ] `tests/functions/test_to_datetime.py::test_to_datetime` / `tests/functions/test_truncate_datetime.py::test_truncate_datetime_containing_NaT`: harcoded check for ns unit, but pandas 3.0 will now default to microseconds (needs a test update)
   - [ ] `tests/functions/test_pivot_longer.py::test_pivot_sort_by_appearance` and `tests/functions/test_sort_column_value_order.py::test_sort_column_value_order`: still have to figure out why sorting order might have changed (EDIT: this is now marked as xfailed after I fetched the latest changes from `dev`)
   - [ ] `tests/functions/test_select_columns.py::test_select_groupby`: the creation of the expected result with `dataframe.select_dtypes("number").groupby(dataframe["a"]).sum()` changed behaviour because of Copy-on-Write: `select_dtypes` now returns a shallow copy and not a hard copy, resulting in `groupby` seeing `dataframe["a"]` as a column of the calling dataframe, and therefore dropping it from the result (as it is already set as index)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test / update for the upcoming pandas 3.0 release #1546

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test / update for the upcoming pandas 3.0 release #1546

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions