Skip to content

fix: Ensure time unit upcast for temporal is_in #22828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented May 19, 2025

Partially closes some issues related to #22824. Note that the actual issue looks like it's intended behavior, the reason being the comment here which says:

// can't check for more granular time_unit in less-granular time_unit data,
// or we'll cast away valid/necessary precision (eg: nanosecs to millisecs)

meaning that calling x("ns").is_in(y("us")) will still raise. I am not sure that I agree with this--I think that both sides should simply be upcast--but that can be another issue to resolve if others agree with me, or can be left alone if not. But note that this is the new behavior, which I find a little confusing but is nonetheless logically consistent:

import polars as pl
from datetime import datetime

s_milli = pl.Series([datetime(2025, 1, 1), datetime(2025, 1, 2)], dtype=pl.Datetime("ms"))
s_micro = pl.Series([datetime(2025, 1, 1), datetime(2025, 1, 2)], dtype=pl.Datetime("us"))
s_nano = pl.Series([datetime(2025, 1, 1), datetime(2025, 1, 2)], dtype=pl.Datetime("ns"))

s_milli.is_in([datetime(2025, 1, 1)])  # OK -- LHS less precise than right
s_micro.is_in([datetime(2025, 1, 1)])  # OK -- same time unit
s_nano.is_in([datetime(2025, 1, 1)])   # ERROR: LHS more precise than right

Regardless, when looking into this, I discovered two other issues that this PR addresses:

  1. The TimeUnit class (see here) derives the Ord and PartialOrd. The derive macro assigns ordering from top to bottom as smallest to largest, so our time unit ordering was completely reversed, e.g. TimeUnit::Nanoseconds < TimeUnit::Microseconds. So any comparisons throughout the codebase using time unit comparisons were wrong. Fortunately I don't think there are many.

  2. When temporals were compared but have different time units, they were never cast to the same time unit. So their physicals were off by factors of 1000, resulting in incorrect is_in results.

@mcrumiller mcrumiller changed the title Ensure time unit upcast for temporal is_in fix: Ensure time unit upcast for temporal is_in May 19, 2025
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels May 19, 2025
(DataType::Date, dt_rhs @ DataType::Datetime(_, _)) => IsInTypeCoercionResult::SelfCast {
dtype: dt_rhs.clone(),
strict: false,
},
Copy link
Contributor Author

@mcrumiller mcrumiller May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for date.is_in(datetime) by casting the date to datetime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be allowed. is_in should not do supertyping, all of these weird rules have slipped in overtime and I would rather not expand on them unless absolutely necessary.

Copy link
Contributor Author

@mcrumiller mcrumiller May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coastalwhite I admit that the current warning/error system feels off to me, as I would bet that the most common usage of is_in is when an explicit list of potential values is provided in a python list. If the integer dtypes don't quite match and require upcasting, then having strict dtype matching would preclude users from being able to call e.g. pl.col("uint8").is_in([1, 2, 3]) without explicitly casting the argument list, which a syntactic nightmare.

The same issue applies here for python datetimes. Aside from the performance cost, I don't see the issue in a lossless upcast of the LHS, as it preserves all information--and if we're dealing with a case where the LHS value is invalidated due to being outside the new upcasted range, then it's impossible for the is_in call to evaluate to True on that element, it must be False.

Using max datetime values as reference:

s_milli = pl.Series([datetime(2025, 1, 1), datetime(9999, 1, 1)], dtype=pl.Datetime("ms"))
s_nano = pl.Series([datetime(2025, 1, 1)], dtype=pl.Datetime("ns"))
s_milli.is_in(s_nano)
# Error: 'is_in' cannot check for Nanoseconds precision values in Milliseconds Datetime data

Upcasting the LHS in this case would result in two elements: [2025-01-01, <out of bounds>]. We know that the out of bounds value is not in the RHS. This must always be the case. Surely we can keep a path for datetimes where we simply track the invalidated values after the upcast? This is easy by comparing the original null mask to the casted null mask.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that the current warning/error system feels off to me, as I would bet that the most common usage of is_in is when an explicit list of potential values is provided in a python list. If the integer dtypes don't quite match and require upcasting, then having strict dtype matching would preclude users from being able to call e.g. pl.col("uint8").is_in([1, 2, 3]) without explicitly casting the argument list, which a syntactic nightmare.

I fully agree. From my perspective, this is more a problem of not have dynamic literals for lists.

The same issue applies here for python datetimes. Aside from the performance cost, I don't see the issue in a lossless upcast of the LHS, as it preserves all information--and if we're dealing with a case where the LHS value is invalidated due to being outside the new upcasted range, then it's impossible for the is_in call to evaluate to True on that element, it must be False.

Sadly, it is not lossless at the moment, ms to ns may quietly overflow.

s = pl.Series([datetime(2025, 1, 1), datetime(9999, 1, 1)])
print(s.cast(pl.Datetime("ns")))
shape: (2,)
Series: '' [datetime[ns]]
[
	2025-01-01 00:00:00
	1815-03-31 05:56:08.066277376
]

Surely we can keep a path for datetimes where we simply track the invalidated values after the upcast?

On a technical level, I am not sure how we would do this. We don't want to cast in the compute.

IsInTypeCoercionResult::SelfCast {
dtype: dt_rhs.clone(),
strict: false,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows for e.g. dt("ms").is_in(dt("ns")) by upcasting dt("ms") to dt("ns").

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, the signature of is_in is (T, Sequence[T]) -> Boolean. The datatype should try to conform to datatype of the first parameter and casting self should probably not be done like this.

From my perspective, it would make more sense if this was either disallowed or dt("ns") was cast to dt("ms").

IsInTypeCoercionResult::SelfCast {
dtype: dt_rhs.clone(),
strict: false,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, but with duration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem.

Copy link

codecov bot commented May 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.03%. Comparing base (5be0a9b) to head (65be7df).
Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22828      +/-   ##
==========================================
+ Coverage   80.99%   81.03%   +0.03%     
==========================================
  Files        1674     1674              
  Lines      236899   237118     +219     
  Branches     2787     2792       +5     
==========================================
+ Hits       191882   192151     +269     
+ Misses      44350    44297      -53     
- Partials      667      670       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mcrumiller mcrumiller marked this pull request as ready for review May 20, 2025 00:04
Copy link
Collaborator

@coastalwhite coastalwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a bad bug. Although, I am not 100% sure this is the solution and might lead to other problems.

(DataType::Date, dt_rhs @ DataType::Datetime(_, _)) => IsInTypeCoercionResult::SelfCast {
dtype: dt_rhs.clone(),
strict: false,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be allowed. is_in should not do supertyping, all of these weird rules have slipped in overtime and I would rather not expand on them unless absolutely necessary.

IsInTypeCoercionResult::SelfCast {
dtype: dt_rhs.clone(),
strict: false,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, the signature of is_in is (T, Sequence[T]) -> Boolean. The datatype should try to conform to datatype of the first parameter and casting self should probably not be done like this.

From my perspective, it would make more sense if this was either disallowed or dt("ns") was cast to dt("ms").

IsInTypeCoercionResult::SelfCast {
dtype: dt_rhs.clone(),
strict: false,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem.

@mcrumiller
Copy link
Contributor Author

@coastalwhite I've updated the rules a bit after our discussion. Let me know if this sounds reasonable. This only applies to temporals:

  • If the RHS is more precise than the right, we downcast the RHS with a warning that precision may be lost.
  • If the RHS is less precise than the right, we error. We cannot safely downcast anything.

I don't love this but I think it's better than just erroring all the time. I think that the best solution is to 1) fix temporal upcasting so that it doesn't silently convert to wrong values, and 2) upcasting the appropriate side. This may be a bit more expensive but it is the most correct. In the meanwhile we will have to live with something less optimal.

@mcrumiller
Copy link
Contributor Author

Parking in draft until I can fix the lint errors tonight.

@mcrumiller mcrumiller marked this pull request as draft May 21, 2025 13:28
@mcrumiller mcrumiller marked this pull request as ready for review May 21, 2025 22:27
@mcrumiller
Copy link
Contributor Author

@coastalwhite forgot to ping you--I think this is ready for another look.

@coastalwhite
Copy link
Collaborator

I think if #22901 is merged, this PR can remove the warnings 😄

@mcrumiller
Copy link
Contributor Author

mcrumiller commented May 23, 2025

I think if #22901 is merged, this PR can remove the warnings 😄

Hmm, but that doesn't affect the fact that the downcasts in that PR toss away information, e.g. ['2024-05-0-1'].is_in(['2024-05-01 10:13:22']) will return True becuase the RHS gets downcast.

However, it does allow for the proper solution which (sometimes) requires LHS upcast in the compute. We can then track the null changes and provide a completely correct answer--do you want me to work on that?

@mcrumiller mcrumiller marked this pull request as draft May 25, 2025 19:30
@coastalwhite
Copy link
Collaborator

I am sorry @mcrumiller if I have kept you a bit in the dark here. In the past, we have been very liberal with upcasts, which we then cannot revert. So I have grown a bit weary of adding more upcasts. To me, this is more of a problem of the dynamic literals.

I think most problems here can be addressed with improving / fixing those.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented May 26, 2025

@coastalwhite I have a method that is 100% correct entirely with downcasts only, but they must happen in the compute since it involves tracking which elements become nullified on downcast. Do you want me to try this out, or instead look into the dynamic literals? That sounds a bit tangential to this PR.

@mcrumiller
Copy link
Contributor Author

So, my method does work, but with is_in_list it wouldn't be super easy to implement without a decent amount of bloat. For the record, suppose we have d_ns = datetime(ns)andd_us = datetime(us)and we wantd_ns.is_in(d_us)`:

  1. Record d_ns % 1_000 != 0. These values cannot be in d_us because they contain higher precision data than d_us can hold.
  2. Record d_ns.is_null(), then downcast d_ns to d_ns(us).
  3. Compare d_ns(us).is_null(). The new null values are values were nullified by the cast, and therefore cannot be in d_us.
  4. The bitwise or of (1) and (3) are a mask indicating which values must be False.
  5. Return d_ns(us).is_in(d_us) & !mask.

...but in practice, with Lists and what not, this would basically duplicate all of the existing boilerplate code. If you have a good dynamic literal solution that sounds like it might be better I'm all ears, but I do question how python datetime literals are supposed to interact with nanosecond data, or how is_in() for two different time unit series is supposed to be handled. The above seems like a good correct solution that avoids upcasts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants