-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fix: Ensure time unit upcast for temporal is_in
#22828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
is_in
is_in
(DataType::Date, dt_rhs @ DataType::Datetime(_, _)) => IsInTypeCoercionResult::SelfCast { | ||
dtype: dt_rhs.clone(), | ||
strict: false, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for date.is_in(datetime)
by casting the date to datetime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be allowed. is_in
should not do supertyping, all of these weird rules have slipped in overtime and I would rather not expand on them unless absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coastalwhite I admit that the current warning/error system feels off to me, as I would bet that the most common usage of is_in
is when an explicit list of potential values is provided in a python list. If the integer dtypes don't quite match and require upcasting, then having strict dtype matching would preclude users from being able to call e.g. pl.col("uint8").is_in([1, 2, 3])
without explicitly casting the argument list, which a syntactic nightmare.
The same issue applies here for python datetimes. Aside from the performance cost, I don't see the issue in a lossless upcast of the LHS, as it preserves all information--and if we're dealing with a case where the LHS value is invalidated due to being outside the new upcasted range, then it's impossible for the is_in
call to evaluate to True on that element, it must be False.
Using max datetime values as reference:
s_milli = pl.Series([datetime(2025, 1, 1), datetime(9999, 1, 1)], dtype=pl.Datetime("ms"))
s_nano = pl.Series([datetime(2025, 1, 1)], dtype=pl.Datetime("ns"))
s_milli.is_in(s_nano)
# Error: 'is_in' cannot check for Nanoseconds precision values in Milliseconds Datetime data
Upcasting the LHS in this case would result in two elements: [2025-01-01, <out of bounds>]. We know that the out of bounds value is not in the RHS. This must always be the case. Surely we can keep a path for datetimes where we simply track the invalidated values after the upcast? This is easy by comparing the original null mask to the casted null mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that the current warning/error system feels off to me, as I would bet that the most common usage of is_in is when an explicit list of potential values is provided in a python list. If the integer dtypes don't quite match and require upcasting, then having strict dtype matching would preclude users from being able to call e.g. pl.col("uint8").is_in([1, 2, 3]) without explicitly casting the argument list, which a syntactic nightmare.
I fully agree. From my perspective, this is more a problem of not have dynamic literals for lists.
The same issue applies here for python datetimes. Aside from the performance cost, I don't see the issue in a lossless upcast of the LHS, as it preserves all information--and if we're dealing with a case where the LHS value is invalidated due to being outside the new upcasted range, then it's impossible for the is_in call to evaluate to True on that element, it must be False.
Sadly, it is not lossless at the moment, ms
to ns
may quietly overflow.
s = pl.Series([datetime(2025, 1, 1), datetime(9999, 1, 1)])
print(s.cast(pl.Datetime("ns")))
shape: (2,)
Series: '' [datetime[ns]]
[
2025-01-01 00:00:00
1815-03-31 05:56:08.066277376
]
Surely we can keep a path for datetimes where we simply track the invalidated values after the upcast?
On a technical level, I am not sure how we would do this. We don't want to cast in the compute.
IsInTypeCoercionResult::SelfCast { | ||
dtype: dt_rhs.clone(), | ||
strict: false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows for e.g. dt("ms").is_in(dt("ns"))
by upcasting dt("ms")
to dt("ns")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, the signature of is_in
is (T, Sequence[T]) -> Boolean
. The datatype should try to conform to datatype of the first parameter and casting self should probably not be done like this.
From my perspective, it would make more sense if this was either disallowed or dt("ns")
was cast to dt("ms")
.
IsInTypeCoercionResult::SelfCast { | ||
dtype: dt_rhs.clone(), | ||
strict: false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, but with duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #22828 +/- ##
==========================================
+ Coverage 80.99% 81.03% +0.03%
==========================================
Files 1674 1674
Lines 236899 237118 +219
Branches 2787 2792 +5
==========================================
+ Hits 191882 192151 +269
+ Misses 44350 44297 -53
- Partials 667 670 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a bad bug. Although, I am not 100% sure this is the solution and might lead to other problems.
(DataType::Date, dt_rhs @ DataType::Datetime(_, _)) => IsInTypeCoercionResult::SelfCast { | ||
dtype: dt_rhs.clone(), | ||
strict: false, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be allowed. is_in
should not do supertyping, all of these weird rules have slipped in overtime and I would rather not expand on them unless absolutely necessary.
IsInTypeCoercionResult::SelfCast { | ||
dtype: dt_rhs.clone(), | ||
strict: false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, the signature of is_in
is (T, Sequence[T]) -> Boolean
. The datatype should try to conform to datatype of the first parameter and casting self should probably not be done like this.
From my perspective, it would make more sense if this was either disallowed or dt("ns")
was cast to dt("ms")
.
IsInTypeCoercionResult::SelfCast { | ||
dtype: dt_rhs.clone(), | ||
strict: false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem.
@coastalwhite I've updated the rules a bit after our discussion. Let me know if this sounds reasonable. This only applies to temporals:
I don't love this but I think it's better than just erroring all the time. I think that the best solution is to 1) fix temporal upcasting so that it doesn't silently convert to wrong values, and 2) upcasting the appropriate side. This may be a bit more expensive but it is the most correct. In the meanwhile we will have to live with something less optimal. |
2491660
to
37e9cb4
Compare
Parking in draft until I can fix the lint errors tonight. |
55c707b
to
65be7df
Compare
@coastalwhite forgot to ping you--I think this is ready for another look. |
I think if #22901 is merged, this PR can remove the warnings 😄 |
Hmm, but that doesn't affect the fact that the downcasts in that PR toss away information, e.g. However, it does allow for the proper solution which (sometimes) requires LHS upcast in the compute. We can then track the null changes and provide a completely correct answer--do you want me to work on that? |
I am sorry @mcrumiller if I have kept you a bit in the dark here. In the past, we have been very liberal with upcasts, which we then cannot revert. So I have grown a bit weary of adding more upcasts. To me, this is more of a problem of the dynamic literals. I think most problems here can be addressed with improving / fixing those. |
@coastalwhite I have a method that is 100% correct entirely with downcasts only, but they must happen in the compute since it involves tracking which elements become nullified on downcast. Do you want me to try this out, or instead look into the dynamic literals? That sounds a bit tangential to this PR. |
So, my method does work, but with
...but in practice, with Lists and what not, this would basically duplicate all of the existing boilerplate code. If you have a good dynamic literal solution that sounds like it might be better I'm all ears, but I do question how python datetime literals are supposed to interact with nanosecond data, or how |
Partially closes some issues related to #22824. Note that the actual issue looks like it's intended behavior, the reason being the comment here which says:
meaning that calling
x("ns").is_in(y("us"))
will still raise. I am not sure that I agree with this--I think that both sides should simply be upcast--but that can be another issue to resolve if others agree with me, or can be left alone if not. But note that this is the new behavior, which I find a little confusing but is nonetheless logically consistent:Regardless, when looking into this, I discovered two other issues that this PR addresses:
The
TimeUnit
class (see here) derives theOrd
andPartialOrd
. The derive macro assigns ordering from top to bottom as smallest to largest, so our time unit ordering was completely reversed, e.g.TimeUnit::Nanoseconds < TimeUnit::Microseconds
. So any comparisons throughout the codebase using time unit comparisons were wrong. Fortunately I don't think there are many.When temporals were compared but have different time units, they were never cast to the same time unit. So their physicals were off by factors of 1000, resulting in incorrect
is_in
results.