Skip to content

feat: adds support for daft native extensions via stable C ABI#6301

Merged
rchowell merged 12 commits intomainfrom
rchowell/ext
Mar 2, 2026
Merged

feat: adds support for daft native extensions via stable C ABI#6301
rchowell merged 12 commits intomainfrom
rchowell/ext

Conversation

@rchowell
Copy link
Contributor

@rchowell rchowell commented Feb 26, 2026

About

This PR creates a native extension framework for Daft and is modeled after Postgres. Extension authors ship a normal pip-installable python package with a bundled dylib (instructions in tutorial). Users import python functions to get full IDE support and the native implementations are linked by loading the extension into a session.

Design

Crates

  • daft-ext - Single dependency for extension authors (published).
  • daft-ext-abi - Stable #[repr(C)] types defining the ABI contract.
  • daft-ext-core - Public Rust SDK types (DaftScalarFunction, traits, error types).
  • daft-ext-macros - Proc macro (#[daft_extension]).
  • daft-ext-internal - Host-side adapters (ScalarFunctionHandle, module loader). Not published.

Functions

ScalarUDF is Daft's internal trait for scalar functions which cannot cross a
dlopen boundary because it uses Rust trait objects with unstable ABI layouts - so FFI_ScalarFunction in daft-ext-abi is the stable C ABI version of this.

ScalarFunctionHandle in daft-ext-internal wraps a FFI_ScalarFunction and implements both ScalarUDF and ScalarFunctionFactory.

Data crosses the extension boundary using the Arrow C Data Interface
(FFI_ArrowArray + FFI_ArrowSchema) and is zero-copy. The extension FFI uses arrow::ffi directly, not the common-arrow-ffi crate
which is coupled to PyO3.

Installation

We load shared libs once into the process when someone calls .load_extension on the session. The top-level daft.load_extension will load the given extension module into the active session (from context). All defined functions are scoped to the session in which the extension is loaded.

Changes

  • Creates the daft-ext-* crates defined above.
  • Defines the stable C ABI which minimally wraps arrow ffi.
  • Creates the daft-ext-internal to bridge host->abi.
  • Creates the daft-ext-core to bridge extension->abi.
  • Implements internal Daft traits using the daft-ext-internal types.
  • Adds a dvector example extension which is a pgvector clone.
  • Adds a hello example extension with a tutorial document.
  • Extends the session to support native function registration.
  • Adds the get_function method for session-backed function resolution mirroring the SQL implementation.

Can add more clarity upon reviews.

Examples

import daft

# Step 1. Import your extension module
import hello

# Step 2. Load the extension into the current daft session
daft.load_extension(hello)

# Step 3. Use in your dataframe!
df = daft.from_pydict({"name": ["John", "Paul"]})
df = df.select(hello.greet(df["name"]))

See the actual greet rust implementation.

Guide

See Daft Extension Guide.

@rchowell rchowell requested a review from a team as a code owner February 26, 2026 00:53
@github-actions github-actions bot added the feat label Feb 26, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

This PR adds support for native Daft extensions via a stable C ABI, enabling Rust-based scalar functions to be loaded dynamically and integrated into Daft's expression system.

Key changes:

  • Introduces daft-ext-abi crate defining the stable FFI contract with proper versioning
  • Adds daft-ext-core for extension authors to implement DaftScalarFunction trait
  • Implements daft-ext-internal for loading/caching extension modules via dlopen
  • Provides #[daft_extension] procedural macro to generate entry points
  • Integrates with session to support both Python UDFs and native functions
  • Includes comprehensive documentation and two working examples (hello and dvector)

Issues found:

  • The hello example has a type mismatch bug that will cause runtime panics when passed regular Utf8 strings
  • Using RTLD_GLOBAL in Python could potentially cause symbol conflicts between extensions

Confidence Score: 4/5

  • Safe to merge after fixing the type mismatch bug in the hello example
  • The architecture is sound with proper safety documentation and comprehensive testing. The FFI boundary is carefully designed with panic handling and error propagation. One critical bug in the example needs fixing, and there's a design consideration around RTLD_GLOBAL that should be reviewed.
  • Pay close attention to examples/hello/src/lib.rs which has a type handling bug that needs to be fixed before merge

Important Files Changed

Filename Overview
examples/hello/src/lib.rs Simple hello extension example; has type mismatch between return_field validation and call implementation (accepts Utf8 but only handles LargeUtf8)
src/daft-ext-abi/src/lib.rs Defines stable C ABI contract with FFI structs; well-documented with proper versioning and safety annotations
src/daft-ext-internal/src/module.rs Extension module loading via dlopen with ABI version validation and global caching
src/daft-ext-internal/src/function.rs Wraps FFI functions into Daft's ScalarUDF trait; handles serialization and error propagation correctly
daft/session.py Adds load_extension and get_function methods; uses ctypes.RTLD_GLOBAL which may cause symbol conflicts between extensions
src/daft-session/src/session.rs Integrates extension loading into session; refactored function storage to support both Python UDFs and native functions

Sequence Diagram

sequenceDiagram
    participant User as Python User
    participant Session as daft.Session
    participant Loader as daft_ext_internal
    participant Module as Extension .so
    participant Function as Native Function

    User->>Session: load_extension(hello)
    Session->>Session: ctypes.CDLL (RTLD_GLOBAL)
    Session->>Loader: load_and_init_extension(path)
    Loader->>Loader: dlopen & cache module
    Loader->>Module: call daft_module_magic()
    Module-->>Loader: FFI_Module struct
    Loader->>Loader: validate ABI version
    Loader->>Module: call init(session_ctx)
    Module->>Module: install extension
    Module->>Loader: define_function(FFI_ScalarFunction)
    Loader->>Session: attach_function(factory)
    Session-->>User: extension loaded

    User->>Session: get_function("greet", args)
    Session->>Function: get_return_field(args, schema)
    Function-->>Session: Field (via FFI)
    Session->>Function: call(args) [Arrow C Data Interface]
    Function-->>Session: ArrayRef (via FFI)
    Session-->>User: Expression result
Loading

Last reviewed commit: 564548a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

56 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Ok(Field::new("greet", DataType::Utf8, true))
}

fn call(&self, args: &[ArrayRef]) -> DaftResult<ArrayRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_field accepts both Utf8 and LargeUtf8, but as_string::<i64>() only handles LargeUtf8. This will panic at runtime if regular Utf8 (with i32 offsets) is passed.

Suggested change
fn call(&self, args: &[ArrayRef]) -> DaftResult<ArrayRef> {
let names = args[0].as_string_opt::<i32>().unwrap_or_else(|| args[0].as_string::<i64>());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional because Daft only uses i64 offsets (for now).

Comment on lines +269 to +270
# Load the shared library globally so that symbols are visible to other libraries.
ctypes.CDLL(path, mode=ctypes.RTLD_GLOBAL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading with RTLD_GLOBAL makes symbols visible globally, which can cause conflicts if multiple extensions export symbols with the same name. Consider if RTLD_LOCAL would work, or document this requirement if global visibility is necessary for cross-extension dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is noted in the extension guide which suggests prefixing externalized C symbols with your extension name.

@codecov
Copy link

codecov bot commented Feb 26, 2026

Codecov Report

❌ Patch coverage is 73.87914% with 268 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.35%. Comparing base (0598d31) to head (b66a101).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-ext-internal/src/module.rs 49.49% 50 Missing ⚠️
src/daft-session/src/session.rs 38.96% 47 Missing ⚠️
src/daft-session/src/python.rs 7.89% 35 Missing ⚠️
src/daft-ext-internal/src/function.rs 88.00% 33 Missing ⚠️
src/daft-ext-core/src/function.rs 87.66% 28 Missing ⚠️
src/daft-ext-macros/src/lib.rs 0.00% 23 Missing ⚠️
src/daft-ext-core/src/session.rs 75.32% 19 Missing ⚠️
daft/session.py 64.86% 13 Missing ⚠️
src/daft-sql/src/functions.rs 8.33% 11 Missing ⚠️
src/daft-session/src/function.rs 52.63% 9 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6301      +/-   ##
==========================================
- Coverage   74.40%   74.35%   -0.05%     
==========================================
  Files        1007     1019      +12     
  Lines      134317   135312     +995     
==========================================
+ Hits        99936   100612     +676     
- Misses      34381    34700     +319     
Files with missing lines Coverage Δ
daft/__init__.py 86.66% <ø> (ø)
src/daft-ext-abi/src/ffi/arrow.rs 100.00% <100.00%> (ø)
src/daft-ext-abi/src/ffi/strings.rs 100.00% <100.00%> (ø)
src/daft-ext-abi/src/lib.rs 100.00% <100.00%> (ø)
src/daft-ext-core/src/error.rs 100.00% <100.00%> (ø)
src/daft-ext-core/src/ffi/arrow.rs 100.00% <100.00%> (ø)
src/daft-ext-core/src/ffi/trampoline.rs 100.00% <100.00%> (ø)
src/daft-session/src/function.rs 52.63% <52.63%> (ø)
src/daft-sql/src/functions.rs 74.51% <8.33%> (-2.05%) ⬇️
daft/session.py 71.00% <64.86%> (-0.85%) ⬇️
... and 7 more

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@universalmind303 universalmind303 self-requested a review February 26, 2026 17:30
[dependencies]
daft-ext = <version>
arrow-array = "57.1.0"
arrow-schema = "57.1.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are library authors locked to the same arrow version as us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally not, but I will have to confirm what are options are. How might we be able to decouple the arrow versions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill checkout this PR and see if i can come up with something!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the easiest option is to just specify a range/wildcard for the arrow dependencies inside daft-ext-core in lieu of using the workspace dependency versions

Comment on lines +227 to +229
Daft uses `LargeUtf8` (i64 offsets) for strings internally. When downcasting string arrays,
use `as_string::<i64>()` — using `i32` will panic at runtime. Similarly, when checking types
in `return_field`, accept `DataType::LargeUtf8`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe with our new arrow-rs stuff, we'll auto coerce arrow's utf to largeutf8. so in theory they should be able to return anything that is coerceable to a daft type.

Comment on lines +10 to +12
arrow = {workspace = true, features = ["ffi"]}
arrow-array = {workspace = true}
arrow-schema = {workspace = true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm pretty sure this means that any users of this lib are locked to the same arrow version as us.

@rchowell rchowell changed the title feat: adds support for daft native extensions via a stable C ABI feat: adds support for daft native extensions via stable C ABI Mar 2, 2026
@rchowell rchowell merged commit cd92fe3 into main Mar 2, 2026
35 of 36 checks passed
@rchowell rchowell deleted the rchowell/ext branch March 2, 2026 20:57
rchowell added a commit that referenced this pull request Mar 6, 2026
…rfaces (#6337)

Remove arrow-rs dependency from daft-ext-abi by introducing owned Arrow
C Data Interface types (ArrowSchema, ArrowArray, ArrowArrayStream) that
are layout compatible with the C ABI. This allows extensions to use any
Arrow implementation (arrow-rs, arrow2, etc.) without coupling to a
specific version.

## Changes Made

- Add ArrowSchema, ArrowArray, ArrowArrayStream C types to daft-ext-abi
- Add ArrowData (schema + array pair) for safe FFI data transfer
- Move FFI conversion helpers into daft-ext-abi/src/ffi/arrow.rs
- Update DaftScalarFunction trait to use ABI types instead of arrow-rs
- Update DaftSession trait to use ABI types
- Remove arrow-rs dependency from daft-ext-abi Cargo.toml
- Add arrow.rs helpers in daft-ext-core for arrow-rs <-> ABI conversions
- Update hello example to use new ABI types

## Related Issues

- Follow-up to #6301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants