Skip to content

feat: Add protocol aliases for IOConfig#6252

Open
universalmind303 wants to merge 1 commit intomainfrom
universalmind303/protocol-aliases
Open

feat: Add protocol aliases for IOConfig#6252
universalmind303 wants to merge 1 commit intomainfrom
universalmind303/protocol-aliases

Conversation

@universalmind303
Copy link
Member

Changes Made

Adds protocol aliases to IOConfig: user-defined mappings from custom scheme names to existing schemes. For example, "my-s3" -> "s3" lets organizations use domain-specific protocol names that route to standard backends (including native S3, Azure, GCS — not just OpenDAL).

Python API:

io_config = IOConfig(
    protocol_aliases={"my-s3": "s3", "company-store": "gcs"},
    s3=S3Config(endpoint_url="https://my-proprietary-endpoint.example.com"),
)
daft.read_parquet("my-s3://bucket/path", io_config=io_config)

Implementation

  • src/common/io-config/src/config.rs — Added protocol_aliases: BTreeMap<String, String> field to IOConfig, display support, and validate_protocol_aliases() that rejects alias keys matching built-in schemes.
  • src/daft-io/src/lib.rs — Added resolve_url_alias() using Cow for zero-allocation on the common (no-alias) path. Integrated into get_source_and_path(), single_url_get(), single_url_put(), and single_url_get_size(). Added 7 Rust unit tests.
  • src/common/io-config/src/python.rs — Added protocol_aliases parameter to IOConfig::new() and replace() with case normalization and validation. Added getter.
  • daft/daft/__init__.pyi — Updated type stubs.
  • tests/io/test_protocol_aliases.py — 9 config tests + 2 integration tests using OpenDAL fs backend.

Design Decisions

  • Single-level resolution — no chaining, avoids infinite loops
  • Built-in scheme protection — aliasing s3, gcs, etc. as keys is rejected at construction time
  • Case-insensitive — consistent with parse_url() which already lowercases schemes
  • Minimal change surfaceparse_url() and its 17+ external callers remain untouched; alias resolution happens in IOClient methods before calling parse_url()

Related Issues

Builds on PR #6177 (OpenDAL support).

🤖 Generated with Claude Code

Allow user-defined mappings from custom scheme names to existing schemes
(e.g., "my-s3" -> "s3") so organizations can use domain-specific protocol
names that route to any backend including native S3, Azure, GCS, and OpenDAL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@universalmind303 universalmind303 requested a review from a team as a code owner February 19, 2026 22:14
@github-actions github-actions bot added the feat label Feb 19, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 19, 2026

Greptile Summary

This PR adds protocol alias support to IOConfig, enabling custom scheme names that map to existing backends (e.g., "my-s3" -> "s3").

Key changes:

  • Added protocol_aliases: BTreeMap<String, String> to IOConfig with validation against built-in schemes
  • Implemented resolve_url_alias() with zero-allocation Cow for the no-alias case
  • Integrated alias resolution into IOClient methods (single_url_get, single_url_put, single_url_get_size)
  • Added Python API support with case normalization and validation
  • Comprehensive test coverage including config validation and OpenDAL integration tests

Issues found:

  • The io_glob function in src/daft-io/src/python.rs:38 calls parse_url on the unresolved URL but get_source on the original URL (which resolves internally). This causes the extracted path to have the original scheme instead of the resolved one, breaking glob operations with aliases.

Confidence Score: 3/5

  • Safe to merge after fixing the io_glob alias resolution bug
  • The core implementation is solid with proper validation, zero-allocation optimization, and good test coverage. However, there's a logical bug in io_glob that will cause incorrect behavior when using protocol aliases with glob patterns. This needs to be fixed before merging.
  • Pay close attention to src/daft-io/src/python.rs - the io_glob function requires alias resolution before parse_url

Important Files Changed

Filename Overview
src/daft-io/src/python.rs Missing alias resolution in io_glob function causes incorrect path handling
src/daft-io/src/lib.rs Implements resolve_url_alias with proper Cow usage and integrates into IOClient methods
src/common/io-config/src/config.rs Adds protocol_aliases field with validation against built-in schemes
src/common/io-config/src/python.rs Adds protocol_aliases to Python API with case normalization and validation
tests/io/test_protocol_aliases.py Comprehensive config tests and integration tests using OpenDAL fs backend

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User calls daft.read_parquet with aliased URL] --> B[IOClient receives request]
    B --> C{protocol_aliases empty?}
    C -->|Yes| D[Use URL as-is]
    C -->|No| E[resolve_url_alias checks scheme]
    E --> F{Scheme matches alias?}
    F -->|Yes| G[Rewrite scheme to target]
    F -->|No| H[Return original URL]
    G --> I[parse_url extracts SourceType and path]
    H --> I
    D --> I
    I --> J[get_source_and_path creates ObjectSource]
    J --> K{SourceType?}
    K -->|S3| L[S3LikeSource]
    K -->|GCS| M[GCSSource]
    K -->|OpenDAL| N[OpenDALSource]
    K -->|Other| O[Other sources]
    L --> P[Perform I/O operation]
    M --> P
    N --> P
    O --> P
Loading

Last reviewed commit: 287272c

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 19, 2026

Additional Comments (1)

src/daft-io/src/python.rs
parse_url is called on the unresolved input, but get_source (line 42) resolves the alias internally. This causes the path to contain the original scheme instead of the resolved one.

            let resolved = crate::resolve_url_alias(&input, &io_config.unwrap_or_default().config);
            let (_, path) = parse_url(&resolved)?;
            let runtime_handle = get_io_runtime(multithreaded_io);

            runtime_handle.block_on_current_thread(async {
                let source = io_client.get_source(&resolved).await?;

@codecov
Copy link

codecov bot commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 97.60479% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.44%. Comparing base (ffe0024) to head (287272c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-io/src/lib.rs 95.83% 3 Missing ⚠️
src/common/io-config/src/config.rs 92.30% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #6252   +/-   ##
=======================================
  Coverage   73.44%   73.44%           
=======================================
  Files        1001     1001           
  Lines      133163   133261   +98     
=======================================
+ Hits        97798    97875   +77     
- Misses      35365    35386   +21     
Files with missing lines Coverage Δ
src/common/io-config/src/python.rs 54.84% <100.00%> (+1.18%) ⬆️
src/common/io-config/src/config.rs 96.82% <92.30%> (-1.18%) ⬇️
src/daft-io/src/lib.rs 78.29% <95.83%> (+3.91%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant