-
Notifications
You must be signed in to change notification settings - Fork 150
feat: Add support for Expr.str.zfill
#2598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Working through this:
|
ah yeah you need to add it to the |
Thanks. On my way now. |
Is this a bug in pandas? import pandas as pd
import pyarrow as pa
pd.Series(["A", "AB", "ABC"], dtype=pd.ArrowDtype(pa.string())).str.zfill(3) Trackback:
Works fine with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @williambdean ππΌ
Is this a bug in pandas?
It looks like one π
We might be able to workaround by accessing the underlying pyarrow series - one example of that is the following:
narwhals/narwhals/_pandas_like/series.py
Lines 1026 to 1033 in 35ec9fe
if dtype_backend == "pyarrow": | |
import pyarrow.compute as pc | |
from narwhals._arrow.utils import native_to_narwhals_dtype | |
ca = native.array._pa_array | |
result_arr = cast("ChunkedArrayAny", pc.logb(ca, base)) | |
nw_dtype = native_to_narwhals_dtype(result_arr.type, self._version) |
it might still be worth to report it upstream
I've created a pandas issue here pandas-dev/pandas#61485 |
How do y'all do the CASE WHEN. Are there any examples? Or maybe you have a different way of implementing in mind |
@williambdean sure there are a lot of cases and it differs from backend to backend - #2549 has examples for lazy backends
Are you referring to the trailing dash comment? |
In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for what are your thoughts? |
Yes it seems very reasonable to do so ππΌ - the log PR have a similar case of 2-3 conditions needed to be checked |
Awesome. Thanks for the lead. Will look into that example |
IMHO it's OK to xfail a pandas test here, it looks like the rest is mostly ready? we can always make a follow-up to work around the pandas bug if necessary |
Sounds good. The case that I will finish with is the behavior with leading + or - for the implementations that are using lpad |
I've added a test case for the leading +. zfill from python does "+1".zfill(3) = "+01". But it seems like the polars eager does something different These are the current failures with the latest commits:
Which behavior is suppose to match? Seeing some differences
|
Probably good to match polars |
well pandas and polars don't match π |
I don't really understand the pandas output, but we can standardise on what polars does and note the pandas difference in the docstring? This feels like really edge case behavior anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @williambdean ππΌ thanks for your patience!
CI failures aside, I find the code in _spark_like
to be the easiest to read, the most modular, variable names are self clear and the one with less repetition.
I think all other backends would benefit from a similar style of implementation π
Anyway, this is definitly really close to the finish line and get into the next Monday releaseπ
Thanks for that feedback. I've gone ahead and address that. Let me know your thoughts. The _arrow is a bit tough and has me running into "too many local variables" |
Need to digest the typing failure still. Any tips would be appreciated! |
Hey @williambdean I did a commit (cb0c30e) to address both the following:
I restructured a bit and renamed the variables
The issue was that (Pdb) native_result
<Expr ['.when(Series[a]).then(Series[aβ¦'] at 0x7F599DD65430> I moved to |
Thanks for those edits @FBruzzesi |
@MarcoGorelli and/or @dangotbanned would you mind giving it another pass? To me it looks ready to be merged ππ |
Sure, I'll take a look in a bit @FBruzzesi |
Would've been caught by narwhals-dev#2500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all, (especially @williambdean!)
I've got a few suggestions and an idea for PolarsSeries
.
I think we're almost there π
narwhals/_polars/series.py
Outdated
if self._compliant_series._backend_version <= (1, 30, 0): | ||
starts_with_plus = native_series.str.starts_with("+") | ||
length = native_series.str.len_chars() | ||
less_than_width = length < width | ||
|
||
other = ( | ||
native_series.str.slice(1, length) | ||
.str.zfill(width - 1) | ||
.str.pad_start(width, "+") | ||
) | ||
native_result = native_result.zip_with( | ||
mask=~(starts_with_plus & less_than_width), other=other | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might we be able to do this a bit lazier?
There's a lot of pl.Series
methods that dispatch to pl.Expr
on pl.DataFrame | pl.LazyFrame
under the hood.
I feel like we could do something like that for performance, but also re-use our _backend_version
branching logic from PolarsExpr
.
Currently we evaluate at least 6 intermediate pl.Series
, which to me seems like too much for an edge case π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'll see if this holds up in CI π€
Co-authored-by: Dan Redding <[email protected]>
Co-authored-by: Dan Redding <[email protected]>
Co-authored-by: Dan Redding <[email protected]>
I've merged in those suggestions. Thanks @dangotbanned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @williambdean π₯³
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really solid work, thanks @williambdean for sticking to this, and others for reviewing!
just a minor comment, then we'll get this into the next release
def skip_pandas_pyarrow(constructor: Constructor | ConstructorEager) -> None: | ||
name: str = constructor.__name__ | ||
if name in {"pandas_pyarrow_constructor", "modin_pyarrow_constructor"}: | ||
reason = ( | ||
"pandas with pyarrow backend doesn't support str.zfill, see " | ||
"https://github.com/pandas-dev/pandas/issues/61485" | ||
) | ||
raise pytest.skip(reason=reason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use skip
for things which don't and will never pass
here, however, we can expect that pandas[pyarrow] will eventually support zfill, so it think we should use request.applymarker(pytest.mark.xfail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli I keep forgetting to suggest we add more tools like this in tests.utils.py
But trimmed down to just the constructor test, like
Lines 154 to 156 in 99222f6
def is_pyarrow_windows_no_tzdata(constructor: Constructor, /) -> bool: | |
"""Skip test on Windows when the tz database is not configured.""" | |
return "pyarrow" in str(constructor) and is_windows() and not windows_has_tzdata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add in a similar fashion to is_pyarrow_windows_no_tzdata
EDIT: updated
Thanks again @williambdean ππΌ |
Thank you all for the assistance! |
This adds
zfill
for both Series and ExprWhat type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below