-
Notifications
You must be signed in to change notification settings - Fork 504
Description
I recently discovered that both polars and datafusion-python do not push down timestamp predicates correctly for pyarrow datasets. This is problematic: using a timestamp filter is a very, very common use case. I suspect both those libraries can implement / fix it but for datafusion going from datafusion (inside of delta-rs) -> pyarrow -> back to datafusion (in datafusion-python) seems like unecessary overhead and evidently somewhat brittle. And since the failure is silent it took me weeks to discover and I only found it because I noticed that using deltalake was a lot slower than just accessing the raw parquet data for certain queries.
Would this crate be opposed to offering a direct integration with datafusion, since datafusion is already used internally? This would be ideal as some sort of extra or plugin but sadly that's not really possible with the current state of PyO3 extensions.
Some ideas:
- datafusion-python allows implementing a PythonTableProvider that while still requiring going from rust -> python -> rust would at least remove a brittle conversion to pyarrow and back and would be more akin to serializing to python and deserializing back to rust (maybe even using serde or https://github.com/davidhewitt/pythonize?). Basically this crate would have to make a python wrapper for the existing TableProvider that could then be passed into datafusion-python.
- This crate depends on datafusion-python and re-exports the entire Python API.
- We add datafusion-python as an optional cargo dependency and then re-export the API under
detlake-datafusion
(which you wouldn't be able to mix and match with plaindeltalake
e.g. passing adeltalake.DetlaTable
between them, you'd have to make adeltalake_datafusion.DeltaTable
although the latter maybe could extract information from the former or call it via the Python APIs).