Skip to content

Test / update for the upcoming pandas 3.0 release #1546

@jorisvandenbossche

Description

@jorisvandenbossche

pandas is nearing a 3.0 release, with a release candidate being available for testing, see the announcement: https://pandas.pydata.org/community/blog/pandas-3.0-release-candidate.html

This is a quite big release with some potentially breaking changes, mostly because of a new default string dtype (no longer object dtype for strings) and consistent copy/view behaviour with Copy-on-Write (no more SettingWithCopyWarnings, but also chained assignment will never work. Now, given the scope of pyjanitor to enable more method chaining, I assume you won't have issues with that last change).

It would be great if projects depending on pandas could test the upcoming release, to 1) find issues we can still fix, and 2) if some changes are needed to the code, be prepared for that when 3.0 comes out.

Given this is very useful for pandas, I started doing this myself for pyjanitor, and documenting here my findings (and feel free to also update this issue description if you want to further use it for tracking compat with pandas 3.0)

Steps taken:

  1. Run the tests with the latest 2.3 and resolve deprecation warnings -> this is already done on main, and checking the last run I see two warnings:
    • tests/functions/test_convert_unix_date.py::test_convert_unix_date: The behavior of 'to_datetime' with 'unit' when parsing strings is deprecated. In a future version, strings will be parsed as datetime strings, matching the behavior without a 'unit'. To retain the old behavior, explicitly cast ints or floats to numeric type before calling to_datetime.
    • tests/functions/test_coalesce.py::test_coalesce_without_target: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
  2. Run the tests with 2.3 but with future options enabled.
    • I only did that for CoW with PANDAS_COPY_ON_WRITE=warn pixi run -e py312 pytest tests/ --ignore tests/spark, and checked the few cases where it warned for setting on a view that in the future no longer will propagate, but it were all false positives (i.e. cases where we don't care about that)
    • For the string dtype, you can also directly test with the 3.0 rc now, see below
  3. Run the tests with 3.0 RC. Started with doing that locally, and this yields a whole bunch of failures:
    • Upstream regression in boolean filtering, already fixed in the new RC1 (REGR: RangeIndex __getitem__ filtering with boolean extension array of length 1 pandas-dev/pandas#63391)
    • A recently added regex in clean_names ([ENH]/[TST] Issue #1257 fix #1538) is causing errors, because lookahead/lookbehind is apparently not supported by the regex engine used by pyarrow. Reported upstream: BUG: string replace results in invalid regular expression: invalid perl operator: (?<= pandas-dev/pandas#63385
    • Similarly, usage of \Z in regex gives problems, see the above linked issue
    • tests/functions/test_data_description.py::test_description_list et al: this is failing because the accessor relies on the fact of being cached, which changed. I opened upstream issue REGR: registered accessors are no longer cached per DataFrame pandas-dev/pandas#63393 about this. Feedback would be very welcome
    • tests/functions/test_convert_unix_date.py::test_convert_unix_date failing because of the mentioned deprecation in bullet point 1). If you want to keep this working (although it is deprecated), you have to cast the column to integer before passing to to_datetime (and otherwise it requires a test update to remove the strings from the test data)
    • I get failures in tests/functions/test_conditional_join.py because of hypothesis creating string data that consists of invalid unicode, and that is now no longer supported in pandas (unless you explicitly specify dtype=object). Opened upstream issue to question if we should do that fallback automatically: Constructor with invalid unicode: automatically fall back to object dtype? pandas-dev/pandas#63396
    • tests/functions/test_concatenate_columns.py::test_concatenate_columns_null_values fails because df[column_names].astype(str).fillna("") now no longer converts missing values in the astype step to the string "nan", and so now actually get filled with "". This seems a "good" change, but requires a test update to make the test passing
    • tests/functions/test_complete.py::test_fill_value_scalar: there is an explicit astype(object) to construct the expected result, but the actual result now has string dtype, causing assert_frame_equal to fail (I assume just removing the astype should fix it). See the migration guide section
      • Similar for failures in tests/functions/test_move.py::test_move_source_target_seq/test_move_source_target_seq_after
    • tests/io/test_tidyxl.py::test_default_values_blank_cells_true: there is an explicit check for None, but the new string dtype now uses np.nan as missing value indicator (see the migration guide section)
    • tests/functions/test_to_datetime.py::test_to_datetime / tests/functions/test_truncate_datetime.py::test_truncate_datetime_containing_NaT: harcoded check for ns unit, but pandas 3.0 will now default to microseconds (needs a test update)
    • tests/functions/test_pivot_longer.py::test_pivot_sort_by_appearance and tests/functions/test_sort_column_value_order.py::test_sort_column_value_order: still have to figure out why sorting order might have changed (EDIT: this is now marked as xfailed after I fetched the latest changes from dev)
    • tests/functions/test_select_columns.py::test_select_groupby: the creation of the expected result with dataframe.select_dtypes("number").groupby(dataframe["a"]).sum() changed behaviour because of Copy-on-Write: select_dtypes now returns a shallow copy and not a hard copy, resulting in groupby seeing dataframe["a"] as a column of the calling dataframe, and therefore dropping it from the result (as it is already set as index)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions