fix: Ensure time unit upcast for temporal `is_in` #22828

mcrumiller · 2025-05-19T19:48:17Z

Partially closes some issues related to #22824. Note that the actual issue looks like it's intended behavior, the reason being the comment here which says:

// can't check for more granular time_unit in less-granular time_unit data,
// or we'll cast away valid/necessary precision (eg: nanosecs to millisecs)

meaning that calling x("ns").is_in(y("us")) will still raise. I am not sure that I agree with this--I think that both sides should simply be upcast--but that can be another issue to resolve if others agree with me, or can be left alone if not. But note that this is the new behavior, which I find a little confusing but is nonetheless logically consistent:

import polars as pl
from datetime import datetime

s_milli = pl.Series([datetime(2025, 1, 1), datetime(2025, 1, 2)], dtype=pl.Datetime("ms"))
s_micro = pl.Series([datetime(2025, 1, 1), datetime(2025, 1, 2)], dtype=pl.Datetime("us"))
s_nano = pl.Series([datetime(2025, 1, 1), datetime(2025, 1, 2)], dtype=pl.Datetime("ns"))

s_milli.is_in([datetime(2025, 1, 1)])  # OK -- LHS less precise than right
s_micro.is_in([datetime(2025, 1, 1)])  # OK -- same time unit
s_nano.is_in([datetime(2025, 1, 1)])   # ERROR: LHS more precise than right

Regardless, when looking into this, I discovered two other issues that this PR addresses:

The TimeUnit class (see here) derives the Ord and PartialOrd. The derive macro assigns ordering from top to bottom as smallest to largest, so our time unit ordering was completely reversed, e.g. TimeUnit::Nanoseconds < TimeUnit::Microseconds. So any comparisons throughout the codebase using time unit comparisons were wrong. Fortunately I don't think there are many.
When temporals were compared but have different time units, they were never cast to the same time unit. So their physicals were off by factors of 1000, resulting in incorrect is_in results.

mcrumiller · 2025-05-19T19:54:57Z

crates/polars-plan/src/plans/conversion/type_coercion/is_in.rs

+        (DataType::Date, dt_rhs @ DataType::Datetime(_, _)) => IsInTypeCoercionResult::SelfCast {
+            dtype: dt_rhs.clone(),
+            strict: false,
+        },


This allows for date.is_in(datetime) by casting the date to datetime.

I don't think this should be allowed. is_in should not do supertyping, all of these weird rules have slipped in overtime and I would rather not expand on them unless absolutely necessary.

@coastalwhite I admit that the current warning/error system feels off to me, as I would bet that the most common usage of is_in is when an explicit list of potential values is provided in a python list. If the integer dtypes don't quite match and require upcasting, then having strict dtype matching would preclude users from being able to call e.g. pl.col("uint8").is_in([1, 2, 3]) without explicitly casting the argument list, which a syntactic nightmare.

The same issue applies here for python datetimes. Aside from the performance cost, I don't see the issue in a lossless upcast of the LHS, as it preserves all information--and if we're dealing with a case where the LHS value is invalidated due to being outside the new upcasted range, then it's impossible for the is_in call to evaluate to True on that element, it must be False.

Using max datetime values as reference:

s_milli = pl.Series([datetime(2025, 1, 1), datetime(9999, 1, 1)], dtype=pl.Datetime("ms")) s_nano = pl.Series([datetime(2025, 1, 1)], dtype=pl.Datetime("ns")) s_milli.is_in(s_nano) # Error: 'is_in' cannot check for Nanoseconds precision values in Milliseconds Datetime data

Upcasting the LHS in this case would result in two elements: [2025-01-01, <out of bounds>]. We know that the out of bounds value is not in the RHS. This must always be the case. Surely we can keep a path for datetimes where we simply track the invalidated values after the upcast? This is easy by comparing the original null mask to the casted null mask.

I admit that the current warning/error system feels off to me, as I would bet that the most common usage of is_in is when an explicit list of potential values is provided in a python list. If the integer dtypes don't quite match and require upcasting, then having strict dtype matching would preclude users from being able to call e.g. pl.col("uint8").is_in([1, 2, 3]) without explicitly casting the argument list, which a syntactic nightmare.

I fully agree. From my perspective, this is more a problem of not have dynamic literals for lists.

The same issue applies here for python datetimes. Aside from the performance cost, I don't see the issue in a lossless upcast of the LHS, as it preserves all information--and if we're dealing with a case where the LHS value is invalidated due to being outside the new upcasted range, then it's impossible for the is_in call to evaluate to True on that element, it must be False.

Sadly, it is not lossless at the moment, ms to ns may quietly overflow.

s = pl.Series([datetime(2025, 1, 1), datetime(9999, 1, 1)]) print(s.cast(pl.Datetime("ns")))

shape: (2,) Series: '' [datetime[ns]] [ 2025-01-01 00:00:00 1815-03-31 05:56:08.066277376 ]

Surely we can keep a path for datetimes where we simply track the invalidated values after the upcast?

On a technical level, I am not sure how we would do this. We don't want to cast in the compute.

mcrumiller · 2025-05-19T19:56:43Z

crates/polars-plan/src/plans/conversion/type_coercion/is_in.rs

+                IsInTypeCoercionResult::SelfCast {
+                    dtype: dt_rhs.clone(),
+                    strict: false,
+                }


This allows for e.g. dt("ms").is_in(dt("ns")) by upcasting dt("ms") to dt("ns").

From my perspective, the signature of is_in is (T, Sequence[T]) -> Boolean. The datatype should try to conform to datatype of the first parameter and casting self should probably not be done like this.

From my perspective, it would make more sense if this was either disallowed or dt("ns") was cast to dt("ms").

mcrumiller · 2025-05-19T19:57:03Z

crates/polars-plan/src/plans/conversion/type_coercion/is_in.rs

+                IsInTypeCoercionResult::SelfCast {
+                    dtype: dt_rhs.clone(),
+                    strict: false,
+                }


Same, but with duration.

codecov · 2025-05-19T20:55:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.03%. Comparing base (5be0a9b) to head (65be7df).
Report is 13 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #22828      +/-   ##
==========================================
+ Coverage   80.99%   81.03%   +0.03%     
==========================================
  Files        1674     1674              
  Lines      236899   237118     +219     
  Branches     2787     2792       +5     
==========================================
+ Hits       191882   192151     +269     
+ Misses      44350    44297      -53     
- Partials      667      670       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coastalwhite

This is quite a bad bug. Although, I am not 100% sure this is the solution and might lead to other problems.

coastalwhite · 2025-05-20T07:41:04Z

crates/polars-plan/src/plans/conversion/type_coercion/is_in.rs

+        (DataType::Date, dt_rhs @ DataType::Datetime(_, _)) => IsInTypeCoercionResult::SelfCast {
+            dtype: dt_rhs.clone(),
+            strict: false,
+        },


I don't think this should be allowed. is_in should not do supertyping, all of these weird rules have slipped in overtime and I would rather not expand on them unless absolutely necessary.

coastalwhite · 2025-05-20T07:44:52Z

crates/polars-plan/src/plans/conversion/type_coercion/is_in.rs

+                IsInTypeCoercionResult::SelfCast {
+                    dtype: dt_rhs.clone(),
+                    strict: false,
+                }


From my perspective, the signature of is_in is (T, Sequence[T]) -> Boolean. The datatype should try to conform to datatype of the first parameter and casting self should probably not be done like this.

From my perspective, it would make more sense if this was either disallowed or dt("ns") was cast to dt("ms").

coastalwhite · 2025-05-20T07:44:59Z

crates/polars-plan/src/plans/conversion/type_coercion/is_in.rs

+                IsInTypeCoercionResult::SelfCast {
+                    dtype: dt_rhs.clone(),
+                    strict: false,
+                }


mcrumiller · 2025-05-21T12:22:30Z

@coastalwhite I've updated the rules a bit after our discussion. Let me know if this sounds reasonable. This only applies to temporals:

If the RHS is more precise than the right, we downcast the RHS with a warning that precision may be lost.
If the RHS is less precise than the right, we error. We cannot safely downcast anything.

I don't love this but I think it's better than just erroring all the time. I think that the best solution is to 1) fix temporal upcasting so that it doesn't silently convert to wrong values, and 2) upcasting the appropriate side. This may be a bit more expensive but it is the most correct. In the meanwhile we will have to live with something less optimal.

mcrumiller · 2025-05-21T13:27:54Z

Parking in draft until I can fix the lint errors tonight.

mcrumiller · 2025-05-22T15:36:55Z

@coastalwhite forgot to ping you--I think this is ready for another look.

coastalwhite · 2025-05-23T09:23:12Z

I think if #22901 is merged, this PR can remove the warnings 😄

mcrumiller · 2025-05-23T12:48:29Z

I think if #22901 is merged, this PR can remove the warnings 😄

Hmm, but that doesn't affect the fact that the downcasts in that PR toss away information, e.g. ['2024-05-0-1'].is_in(['2024-05-01 10:13:22']) will return True becuase the RHS gets downcast.

However, it does allow for the proper solution which (sometimes) requires LHS upcast in the compute. We can then track the null changes and provide a completely correct answer--do you want me to work on that?

coastalwhite · 2025-05-26T09:23:37Z

I am sorry @mcrumiller if I have kept you a bit in the dark here. In the past, we have been very liberal with upcasts, which we then cannot revert. So I have grown a bit weary of adding more upcasts. To me, this is more of a problem of the dynamic literals.

I think most problems here can be addressed with improving / fixing those.

mcrumiller · 2025-05-26T13:21:01Z

@coastalwhite I have a method that is 100% correct entirely with downcasts only, but they must happen in the compute since it involves tracking which elements become nullified on downcast. Do you want me to try this out, or instead look into the dynamic literals? That sounds a bit tangential to this PR.

mcrumiller · 2025-05-26T23:36:10Z

So, my method does work, but with is_in_list it wouldn't be super easy to implement without a decent amount of bloat. For the record, suppose we have d_ns = datetime(ns)andd_us = datetime(us)and we wantd_ns.is_in(d_us)`:

Record d_ns % 1_000 != 0. These values cannot be in d_us because they contain higher precision data than d_us can hold.
Record d_ns.is_null(), then downcast d_ns to d_ns(us).
Compare d_ns(us).is_null(). The new null values are values were nullified by the cast, and therefore cannot be in d_us.
The bitwise or of (1) and (3) are a mask indicating which values must be False.
Return d_ns(us).is_in(d_us) & !mask.

...but in practice, with Lists and what not, this would basically duplicate all of the existing boilerplate code. If you have a good dynamic literal solution that sounds like it might be better I'm all ears, but I do question how python datetime literals are supposed to interact with nanosecond data, or how is_in() for two different time unit series is supposed to be handled. The above seems like a good correct solution that avoids upcasts.

Allow is_in for more temporals

4958a77

github-actions bot added the title needs formatting label May 19, 2025

mcrumiller changed the title ~~Ensure time unit upcast for temporal is_in~~ fix: Ensure time unit upcast for temporal is_in May 19, 2025

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels May 19, 2025

Move comment

ad53456

mcrumiller commented May 19, 2025

View reviewed changes

Adjust test

2ef44d6

mcrumiller marked this pull request as ready for review May 20, 2025 00:04

mcrumiller requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners May 20, 2025 00:04

coastalwhite requested changes May 20, 2025

View reviewed changes

Update casting rules

81ade3d

lint

37e9cb4

mcrumiller force-pushed the is-in-temporal branch from 2491660 to 37e9cb4 Compare May 21, 2025 12:28

mcrumiller marked this pull request as draft May 21, 2025 13:28

Fix zip

1dd6d0b

mcrumiller force-pushed the is-in-temporal branch from 55c707b to 65be7df Compare May 21, 2025 22:11

Remove warning

65be7df

mcrumiller marked this pull request as ready for review May 21, 2025 22:27

mcrumiller marked this pull request as draft May 25, 2025 19:30

Uh oh!

fix: Ensure time unit upcast for temporal is_in #22828

Are you sure you want to change the base?

fix: Ensure time unit upcast for temporal is_in #22828

Uh oh!

Conversation

mcrumiller commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcrumiller May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcrumiller May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coastalwhite left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcrumiller commented May 21, 2025

Uh oh!

mcrumiller commented May 21, 2025

Uh oh!

mcrumiller commented May 22, 2025

Uh oh!

coastalwhite commented May 23, 2025

Uh oh!

mcrumiller commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coastalwhite commented May 26, 2025

Uh oh!

mcrumiller commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcrumiller commented May 26, 2025

Uh oh!

Uh oh!

fix: Ensure time unit upcast for temporal `is_in` #22828

fix: Ensure time unit upcast for temporal `is_in` #22828

mcrumiller commented May 19, 2025 •

edited

Loading

mcrumiller May 19, 2025 •

edited

Loading

mcrumiller May 20, 2025 •

edited

Loading

codecov bot commented May 19, 2025 •

edited

Loading

mcrumiller commented May 23, 2025 •

edited

Loading

mcrumiller commented May 26, 2025 •

edited

Loading