feat: adds support for daft native extensions via stable C ABI#6301
feat: adds support for daft native extensions via stable C ABI#6301
Conversation
Greptile SummaryThis PR adds support for native Daft extensions via a stable C ABI, enabling Rust-based scalar functions to be loaded dynamically and integrated into Daft's expression system. Key changes:
Issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User as Python User
participant Session as daft.Session
participant Loader as daft_ext_internal
participant Module as Extension .so
participant Function as Native Function
User->>Session: load_extension(hello)
Session->>Session: ctypes.CDLL (RTLD_GLOBAL)
Session->>Loader: load_and_init_extension(path)
Loader->>Loader: dlopen & cache module
Loader->>Module: call daft_module_magic()
Module-->>Loader: FFI_Module struct
Loader->>Loader: validate ABI version
Loader->>Module: call init(session_ctx)
Module->>Module: install extension
Module->>Loader: define_function(FFI_ScalarFunction)
Loader->>Session: attach_function(factory)
Session-->>User: extension loaded
User->>Session: get_function("greet", args)
Session->>Function: get_return_field(args, schema)
Function-->>Session: Field (via FFI)
Session->>Function: call(args) [Arrow C Data Interface]
Function-->>Session: ArrayRef (via FFI)
Session-->>User: Expression result
Last reviewed commit: 564548a |
| Ok(Field::new("greet", DataType::Utf8, true)) | ||
| } | ||
|
|
||
| fn call(&self, args: &[ArrayRef]) -> DaftResult<ArrayRef> { |
There was a problem hiding this comment.
return_field accepts both Utf8 and LargeUtf8, but as_string::<i64>() only handles LargeUtf8. This will panic at runtime if regular Utf8 (with i32 offsets) is passed.
| fn call(&self, args: &[ArrayRef]) -> DaftResult<ArrayRef> { | |
| let names = args[0].as_string_opt::<i32>().unwrap_or_else(|| args[0].as_string::<i64>()); |
There was a problem hiding this comment.
This was intentional because Daft only uses i64 offsets (for now).
| # Load the shared library globally so that symbols are visible to other libraries. | ||
| ctypes.CDLL(path, mode=ctypes.RTLD_GLOBAL) |
There was a problem hiding this comment.
Loading with RTLD_GLOBAL makes symbols visible globally, which can cause conflicts if multiple extensions export symbols with the same name. Consider if RTLD_LOCAL would work, or document this requirement if global visibility is necessary for cross-extension dependencies.
There was a problem hiding this comment.
This is noted in the extension guide which suggests prefixing externalized C symbols with your extension name.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6301 +/- ##
==========================================
- Coverage 74.40% 74.35% -0.05%
==========================================
Files 1007 1019 +12
Lines 134317 135312 +995
==========================================
+ Hits 99936 100612 +676
- Misses 34381 34700 +319
🚀 New features to boost your workflow:
|
| [dependencies] | ||
| daft-ext = <version> | ||
| arrow-array = "57.1.0" | ||
| arrow-schema = "57.1.0" |
There was a problem hiding this comment.
are library authors locked to the same arrow version as us?
There was a problem hiding this comment.
Ideally not, but I will have to confirm what are options are. How might we be able to decouple the arrow versions?
There was a problem hiding this comment.
Ill checkout this PR and see if i can come up with something!
There was a problem hiding this comment.
So I think the easiest option is to just specify a range/wildcard for the arrow dependencies inside daft-ext-core in lieu of using the workspace dependency versions
| Daft uses `LargeUtf8` (i64 offsets) for strings internally. When downcasting string arrays, | ||
| use `as_string::<i64>()` — using `i32` will panic at runtime. Similarly, when checking types | ||
| in `return_field`, accept `DataType::LargeUtf8`. |
There was a problem hiding this comment.
i believe with our new arrow-rs stuff, we'll auto coerce arrow's utf to largeutf8. so in theory they should be able to return anything that is coerceable to a daft type.
| arrow = {workspace = true, features = ["ffi"]} | ||
| arrow-array = {workspace = true} | ||
| arrow-schema = {workspace = true} |
There was a problem hiding this comment.
i'm pretty sure this means that any users of this lib are locked to the same arrow version as us.
…rfaces (#6337) Remove arrow-rs dependency from daft-ext-abi by introducing owned Arrow C Data Interface types (ArrowSchema, ArrowArray, ArrowArrayStream) that are layout compatible with the C ABI. This allows extensions to use any Arrow implementation (arrow-rs, arrow2, etc.) without coupling to a specific version. ## Changes Made - Add ArrowSchema, ArrowArray, ArrowArrayStream C types to daft-ext-abi - Add ArrowData (schema + array pair) for safe FFI data transfer - Move FFI conversion helpers into daft-ext-abi/src/ffi/arrow.rs - Update DaftScalarFunction trait to use ABI types instead of arrow-rs - Update DaftSession trait to use ABI types - Remove arrow-rs dependency from daft-ext-abi Cargo.toml - Add arrow.rs helpers in daft-ext-core for arrow-rs <-> ABI conversions - Update hello example to use new ABI types ## Related Issues - Follow-up to #6301
About
This PR creates a native extension framework for Daft and is modeled after Postgres. Extension authors ship a normal pip-installable python package with a bundled dylib (instructions in tutorial). Users import python functions to get full IDE support and the native implementations are linked by loading the extension into a session.
Design
Crates
daft-ext- Single dependency for extension authors (published).daft-ext-abi- Stable #[repr(C)] types defining the ABI contract.daft-ext-core- Public Rust SDK types (DaftScalarFunction, traits, error types).daft-ext-macros- Proc macro (#[daft_extension]).daft-ext-internal- Host-side adapters (ScalarFunctionHandle, module loader). Not published.Functions
ScalarUDFis Daft's internal trait for scalar functions which cannot cross adlopenboundary because it uses Rust trait objects with unstable ABI layouts - soFFI_ScalarFunctionindaft-ext-abiis the stable C ABI version of this.ScalarFunctionHandleindaft-ext-internalwraps aFFI_ScalarFunctionand implements bothScalarUDFandScalarFunctionFactory.Data crosses the extension boundary using the Arrow C Data Interface
(
FFI_ArrowArray+FFI_ArrowSchema) and is zero-copy. The extension FFI usesarrow::ffidirectly, not thecommon-arrow-fficratewhich is coupled to PyO3.
Installation
We load shared libs once into the process when someone calls
.load_extensionon the session. The top-leveldaft.load_extensionwill load the given extension module into the active session (from context). All defined functions are scoped to the session in which the extension is loaded.Changes
dvectorexample extension which is a pgvector clone.helloexample extension with a tutorial document.get_functionmethod for session-backed function resolution mirroring the SQL implementation.Can add more clarity upon reviews.
Examples
See the actual greet rust implementation.
Guide
See Daft Extension Guide.