refactor: decouple extension ABI from arrow-rs with Arrow C Data Interfaces#6337
refactor: decouple extension ABI from arrow-rs with Arrow C Data Interfaces#6337
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
| daft-ext = {path = "../../src/daft-ext"} | ||
| arrow-array = {version = "57.1.0", features = ["chrono-tz"]} | ||
| arrow-schema = "57.1.0" | ||
| arrow = {version = "58", features = ["ffi"]} |
There was a problem hiding this comment.
note the different major version now
…e types Remove arrow-rs dependency from daft-ext-abi by introducing owned Arrow C Data Interface types (ArrowSchema, ArrowArray, ArrowArrayStream) that are layout- compatible with the C ABI. This allows extensions to use any Arrow implementation (arrow-rs, arrow2, etc.) without coupling to a specific version. Key changes: - Add ArrowSchema, ArrowArray, ArrowArrayStream C types to daft-ext-abi - Add ArrowData (schema + array pair) for safe FFI data transfer - Move FFI conversion helpers into daft-ext-abi/src/ffi/arrow.rs - Update DaftScalarFunction trait to use ABI types instead of arrow-rs - Update DaftSession trait to use ABI types - Remove arrow-rs dependency from daft-ext-abi Cargo.toml - Add arrow.rs helpers in daft-ext-core for arrow-rs <-> ABI conversions - Update hello example to use new ABI types
5ce0830 to
38c3856
Compare
|
@greptileai please provide a critical review, we don't need a diagram. |
There was a problem hiding this comment.
I want to share my perspective, though I'm happy to disagree and commit if the team prefers this direction.
I'd generally favor ergonomics over ABI purity here. Someone writing their first Daft extension shouldn't need to understand C Data Interface types and macro-generated helpers just to get started.
The macro generates a bunch of import/export functions in every crate that calls it, and it forces extension authors to add daft_ext::define_arrow_helpers!() as a mandatory invocation before they can really do anything.
Compare the authoring experience:
This PR:
use daft_ext::prelude::*;
daft_ext::define_arrow_helpers!();
fn call(&self, args: &[ArrowData]) -> DaftResult<ArrowData> {
let input = import_array(unsafe { ArrowData::take_arg(args, 0) })?;
// ... work with input ...
Ok(export_array(&result)?)
}My draft (#6322):
use daft_ext::prelude::*;
fn call(&self, args: &[ArrayRef]) -> DaftResult<ArrayRef> {
let names = args[0].as_string::<i64>();
// ... just use arrow types directly ...
Ok(Arc::new(builder.finish()))
}In #6322, extension authors work with ArrayRef and Field directly via familiar arrow-rs types, no conversion ceremony, no macro.
I understand the argument about "true decoupling" from any Arrow implementation, but I think we should be practical about our actual users. If we intend for these to be primarily rust based, then it naturally makes sense to include arrow-rs as a dependency. The C/C++ extension use case is theoretical at this point (which would still be technically supported by using the arrow abi). Meanwhile, the ergonomic cost is IMO quite noticeable, every extension needs import/export wrapping, unsafe blocks for ArrowData::take_arg, and a magic macro invocation.
My bias would be towards simpler extension authoring experience over theoretical purity at the ABI layer.
But again, happy to go with consensus here.
|
@universalmind303 thanks, I don't disagree on ergonomics. We can always improve ergonomics on top of these APIs with higher-level helpers, primarily macro based. The important thing is to have proper decoupling where we actually interface. The pointer cast solution adds ergonomics to the lowest level interface, whereas I am intentionally choosing to go for purity on the lowest interface and we can build ergonomics up from there. Ideally we are in a future like, #[daft_scalar_function]
pub fn my_add(x: i64, y: i64) -> i64 {
x + y
}And all the arrow details disappear. This is how pgrx handles things, while hiding all the messy details behind the macro. I will continue with the proper decoupling and commit to improving ergonomics on top of this foundation. |
|
Ok, I have found some better ergonomics @universalmind303. I will add helpers for using raw pointers with our stable Arrow C types, and then feature-flag some arrow-rs versions. This removes the customer's boilerplate to get into our stable types, and if they are using arrow-rs then they just add the feature in their Cargo.toml. Updating this now. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6337 +/- ##
==========================================
- Coverage 74.95% 74.80% -0.15%
==========================================
Files 1021 1020 -1
Lines 135873 136113 +240
==========================================
- Hits 101839 101820 -19
- Misses 34034 34293 +259
🚀 New features to boost your workflow:
|
Remove arrow-rs dependency from daft-ext-abi by introducing owned Arrow C Data Interface types (ArrowSchema, ArrowArray, ArrowArrayStream) that are layout compatible with the C ABI. This allows extensions to use any Arrow implementation (arrow-rs, arrow2, etc.) without coupling to a specific version.
Changes Made
Related Issues