Skip to content

feat: Add support for Expr.str.zfill #2598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Jun 20, 2025
Merged

Conversation

williambdean
Copy link
Contributor

@williambdean williambdean commented May 23, 2025

This adds zfill for both Series and Expr

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@williambdean
Copy link
Contributor Author

Working through this:

check-api-reference....................................................................Failed
- hook id: check-api-reference
- exit code: 1

Series.str: not documented
{'zfill'}
Expr.str: not documented
{'zfill'}

@MarcoGorelli
Copy link
Member

ah yeah you need to add it to the expr_str.md / series_str.md pages

@williambdean
Copy link
Contributor Author

Thanks. On my way now.
Will add some tests when I log on later

@williambdean
Copy link
Contributor Author

williambdean commented May 23, 2025

Is this a bug in pandas?

import pandas as pd
import pyarrow as pa

pd.Series(["A", "AB", "ABC"], dtype=pd.ArrowDtype(pa.string())).str.zfill(3)

Trackback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/will/GitHub/various/narwhals/.venv/lib/python3.12/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/will/GitHub/various/narwhals/.venv/lib/python3.12/site-packages/pandas/core/strings/accessor.py", line 1818, in zfill
    result = self._data.array._str_map(f)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?

Works fine with pd.Series(["A", "AB", "ABC"], dtype="string[pyarrow]") though but that seems different dtype. Reference here

@williambdean williambdean marked this pull request as ready for review May 23, 2025 18:01
@FBruzzesi FBruzzesi added the enhancement New feature or request label May 23, 2025
@FBruzzesi FBruzzesi changed the title Feature: zfill feat: Add support for Expr.str.zfill May 23, 2025
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @williambdean πŸ™πŸΌ

Is this a bug in pandas?

It looks like one πŸ˜…
We might be able to workaround by accessing the underlying pyarrow series - one example of that is the following:

if dtype_backend == "pyarrow":
import pyarrow.compute as pc
from narwhals._arrow.utils import native_to_narwhals_dtype
ca = native.array._pa_array
result_arr = cast("ChunkedArrayAny", pc.logb(ca, base))
nw_dtype = native_to_narwhals_dtype(result_arr.type, self._version)

it might still be worth to report it upstream

@williambdean
Copy link
Contributor Author

I've created a pandas issue here pandas-dev/pandas#61485

@williambdean
Copy link
Contributor Author

How do y'all do the CASE WHEN. Are there any examples? Or maybe you have a different way of implementing in mind

@FBruzzesi
Copy link
Member

FBruzzesi commented May 23, 2025

How do y'all do the CASE WHEN. Are there any examples?

@williambdean sure there are a lot of cases and it differs from backend to backend - #2549 has examples for lazy backends

Or maybe you have a different way of implementing in mind

Are you referring to the trailing dash comment?
If so, I guess it depends πŸ˜… pandas already does that out of the box - I have no clue for the other backends though

@williambdean
Copy link
Contributor Author

williambdean commented May 23, 2025

In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for
CASE WHEN starts_with(col, "+") THEN ... WHEN starts_with("-") THEN ... ELSE END

what are your thoughts?

@FBruzzesi
Copy link
Member

In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for CASE WHEN starts_with(col, "+") THEN ... WHEN starts_with("-") THEN ... ELSE END

what are your thoughts?

Yes it seems very reasonable to do so πŸ™ŒπŸΌ - the log PR have a similar case of 2-3 conditions needed to be checked

@williambdean
Copy link
Contributor Author

Awesome. Thanks for the lead. Will look into that example

@MarcoGorelli
Copy link
Member

MarcoGorelli commented May 24, 2025

IMHO it's OK to xfail a pandas test here, it looks like the rest is mostly ready? we can always make a follow-up to work around the pandas bug if necessary

@williambdean
Copy link
Contributor Author

Sounds good. The case that I will finish with is the behavior with leading + or - for the implementations that are using lpad

@williambdean
Copy link
Contributor Author

williambdean commented May 24, 2025

I've added a test case for the leading +. zfill from python does "+1".zfill(3) = "+01". But it seems like the polars eager does something different

These are the current failures with the latest commits:


===================================================== short test summary info =====================================================
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pyarrow] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pyarrow] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pandas[pyarrow]] - AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[polars[eager]] - AssertionError: Mismatch at index 1: 0+1 != +01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[polars[eager]] - AssertionError: Mismatch at index 1: 0+1 != +01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pandas[pyarrow]] - AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[ibis] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[sqlframe] - AssertionError: Mismatch at index 0: 0-1 != -01

Which behavior is suppose to match?

Seeing some differences

[ins] In [4]: pd.Series(["+1", "++1"]).str.zfill(4)
Out[4]:
0    +001
1    +0+1
dtype: object

[ins] In [5]: import polars as pl

[ins] In [6]: pl.Series(["+1", "++1"]).str.zfill(4)
Out[6]:
shape: (2,)
Series: '' [str]
[
        "00+1"
        "0++1"
]

[ins] In [7]: "+1".zfill(4)
Out[7]: '+001'

[ins] In [8]: "++1".zfill(4)
Out[8]: '+0+1'

@MarcoGorelli
Copy link
Member

Probably good to match polars

@williambdean
Copy link
Contributor Author

well pandas and polars don't match πŸ˜†

@MarcoGorelli
Copy link
Member

I don't really understand the pandas output, but we can standardise on what polars does and note the pandas difference in the docstring? This feels like really edge case behavior anyway

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @williambdean πŸ‘‹πŸΌ thanks for your patience!

CI failures aside, I find the code in _spark_like to be the easiest to read, the most modular, variable names are self clear and the one with less repetition.

I think all other backends would benefit from a similar style of implementation πŸ™ˆ

Anyway, this is definitly really close to the finish line and get into the next Monday releaseπŸ‘Œ

@williambdean
Copy link
Contributor Author

Thanks for that feedback. I've gone ahead and address that. Let me know your thoughts. The _arrow is a bit tough and has me running into "too many local variables"

@williambdean
Copy link
Contributor Author

Need to digest the typing failure still. Any tips would be appreciated!

@FBruzzesi
Copy link
Member

FBruzzesi commented Jun 19, 2025

Hey @williambdean I did a commit (cb0c30e) to address both the following:

The _arrow is a bit tough and has me running into "too many local variables"

I restructured a bit and renamed the variables

Need to digest the typing failure still. Any tips would be appreciated!

The issue was that pl.when generates an expression rather than a series. Since the test was calling that in a select context, it was not failing. The result is of the operation is:

(Pdb) native_result
<Expr ['.when(Series[a]).then(Series[a…'] at 0x7F599DD65430>

I moved to pl.Series.zip_with to make sure a Series is maintained in the output and changed the series test slightly to check that

@williambdean
Copy link
Contributor Author

Thanks for those edits @FBruzzesi

@FBruzzesi
Copy link
Member

FBruzzesi commented Jun 19, 2025

@MarcoGorelli and/or @dangotbanned would you mind giving it another pass? To me it looks ready to be merged πŸ‘ŒπŸš€

@dangotbanned
Copy link
Member

@MarcoGorelli and/or @dangotbanned would you mind giving it another pass? To me it looks ready to be merged πŸ‘ŒπŸš€

Sure, I'll take a look in a bit @FBruzzesi

@dangotbanned dangotbanned self-requested a review June 19, 2025 12:50
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all, (especially @williambdean!)

I've got a few suggestions and an idea for PolarsSeries.
I think we're almost there πŸŽ‰

Comment on lines 693 to 705
if self._compliant_series._backend_version <= (1, 30, 0):
starts_with_plus = native_series.str.starts_with("+")
length = native_series.str.len_chars()
less_than_width = length < width

other = (
native_series.str.slice(1, length)
.str.zfill(width - 1)
.str.pad_start(width, "+")
)
native_result = native_result.zip_with(
mask=~(starts_with_plus & less_than_width), other=other
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might we be able to do this a bit lazier?

There's a lot of pl.Series methods that dispatch to pl.Expr on pl.DataFrame | pl.LazyFrame under the hood.
I feel like we could do something like that for performance, but also re-use our _backend_version branching logic from PolarsExpr.

Currently we evaluate at least 6 intermediate pl.Series, which to me seems like too much for an edge case πŸ€”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'll see if this holds up in CI 🀞

perf: Reuse PolarsExpr impl in PolarsSeries

@williambdean
Copy link
Contributor Author

I've merged in those suggestions. Thanks @dangotbanned

Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @williambdean πŸ₯³

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really solid work, thanks @williambdean for sticking to this, and others for reviewing!

just a minor comment, then we'll get this into the next release

Comment on lines 12 to 19
def skip_pandas_pyarrow(constructor: Constructor | ConstructorEager) -> None:
name: str = constructor.__name__
if name in {"pandas_pyarrow_constructor", "modin_pyarrow_constructor"}:
reason = (
"pandas with pyarrow backend doesn't support str.zfill, see "
"https://github.com/pandas-dev/pandas/issues/61485"
)
raise pytest.skip(reason=reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use skip for things which don't and will never pass

here, however, we can expect that pandas[pyarrow] will eventually support zfill, so it think we should use request.applymarker(pytest.mark.xfail)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I keep forgetting to suggest we add more tools like this in tests.utils.py

But trimmed down to just the constructor test, like

narwhals/tests/utils.py

Lines 154 to 156 in 99222f6

def is_pyarrow_windows_no_tzdata(constructor: Constructor, /) -> bool:
"""Skip test on Windows when the tz database is not configured."""
return "pyarrow" in str(constructor) and is_windows() and not windows_has_tzdata()

Copy link
Contributor Author

@williambdean williambdean Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add in a similar fashion to is_pyarrow_windows_no_tzdata

EDIT: updated

@FBruzzesi
Copy link
Member

Thanks again @williambdean πŸ™ŒπŸΌ

UpVoteDownVoteGIF (2)

@FBruzzesi FBruzzesi merged commit 08b2896 into narwhals-dev:main Jun 20, 2025
34 checks passed
@williambdean williambdean deleted the zfill branch June 20, 2025 11:50
@williambdean
Copy link
Contributor Author

Thank you all for the assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pyspark Issue is related to pyspark backend pyspark-connect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: str zfill
4 participants