Skip to content

refactor(arrow2): daft-csv crate#6298

Merged
universalmind303 merged 11 commits intomainfrom
universalmind303/arrow-rs-csv
Feb 27, 2026
Merged

refactor(arrow2): daft-csv crate#6298
universalmind303 merged 11 commits intomainfrom
universalmind303/arrow-rs-csv

Conversation

@universalmind303
Copy link
Member

Changes Made

Related Issues

Move type coercion (Utf8→LargeUtf8, Binary→LargeBinary) upstream into
arrow2's CSV inference and remove the now-unnecessary cast_array_for_daft_if_needed
compatibility shim, along with the daft-arrow dependency from daft-core.
…w-rs

Replace daft-arrow (arrow2) dependencies with native arrow-rs crates (arrow-schema, arrow-array, arrow-buffer) and the sync csv crate. This removes the arrow2 CSV I/O layer, inlining row reading helpers and rewriting deserialization to use arrow-rs builders and primitive types instead of arrow2's TrustedLen-based APIs.
@universalmind303 universalmind303 requested a review from a team as a code owner February 25, 2026 21:01
@universalmind303 universalmind303 marked this pull request as draft February 25, 2026 21:04
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 25, 2026

Greptile Summary

This PR refactors the daft-csv crate by migrating from arrow2 to arrow-rs. The changes replace daft-arrow (arrow2-based) with arrow-rs crates (arrow-schema, arrow-array) and switch from arrow2's CSV reader to the standard csv/csv-async crates.

Key improvements:

  • Fixes previously reported time unit bugs in deserialize.rs - the code now properly matches on time_unit parameters and uses the correct Arrow primitive types (Time32SecondType, Time32MillisecondType, etc.) instead of hardcoding them
  • Removes deprecated arrow2 migration warnings by eliminating #![allow(deprecated)] attributes
  • Reorganizes utilities by moving coerce_to_daft_compatible_schema from daft-core/utils/arrow.rs to daft-parquet/src/utils.rs
  • Rewrites test helper check_equal_local_arrow_rs with a more comprehensive reference implementation using the sync csv crate

Notable changes:

  • New local_read_rows and async_read_rows functions replace arrow2's reader implementations
  • Error handling updated from ArrowError to External errors in schema inference
  • Field conversions now use arrow-rs types throughout the pipeline

Confidence Score: 5/5

  • This PR is safe to merge - it's a well-executed refactor that fixes existing bugs and modernizes the codebase
  • High confidence because this is a systematic migration with clear improvements: fixes previously identified time unit bugs, removes deprecated code, maintains test coverage, and follows a consistent pattern throughout. The changes are well-structured with proper error handling and type conversions.
  • No files require special attention - all changes follow consistent migration patterns

Important Files Changed

Filename Overview
src/daft-decoding/src/deserialize.rs Migrated from arrow2 to arrow-rs, fixing previously reported time unit bugs by properly matching time_unit parameter with correct Arrow types
src/daft-decoding/Cargo.toml Updated dependencies from daft-arrow (arrow2) to arrow-array and csv crates
src/daft-csv/Cargo.toml Updated dependencies, added arrow-schema, csv, and daft-schema, removed daft-arrow
src/daft-csv/src/local.rs Migrated from arrow2 CSV reader to csv crate, added local_read_rows function, updated field conversions
src/daft-csv/src/read.rs Major refactor of CSV reading logic and test helper check_equal_local_arrow_rs with new reference implementation
src/daft-csv/src/metadata.rs Migrated schema inference to arrow-schema, updated error handling from ArrowError to External
src/daft-parquet/src/utils.rs Moved coerce_to_daft_compatible_schema utility from daft-core to daft-parquet

Last reviewed commit: 869da3c

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@universalmind303
Copy link
Member Author

@greptileai

…w-rs-csv

# Conflicts:
#	src/daft-json/src/decoding.rs
#	src/daft-json/src/local.rs
#	src/daft-json/src/read.rs
@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 76.44110% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.32%. Comparing base (4b450a6) to head (cba377a).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-decoding/src/deserialize.rs 51.78% 81 Missing ⚠️
src/daft-csv/src/read.rs 93.37% 10 Missing ⚠️
src/daft-csv/src/local.rs 94.11% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6298      +/-   ##
==========================================
- Coverage   74.37%   74.32%   -0.06%     
==========================================
  Files        1006     1006              
  Lines      134209   134292      +83     
==========================================
- Hits        99823    99810      -13     
- Misses      34386    34482      +96     
Files with missing lines Coverage Δ
src/daft-csv/src/lib.rs 40.00% <ø> (ø)
src/daft-csv/src/local/pool.rs 81.63% <100.00%> (ø)
src/daft-csv/src/metadata.rs 90.37% <100.00%> (+0.17%) ⬆️
src/daft-csv/src/schema.rs 100.00% <100.00%> (ø)
src/daft-parquet/src/lib.rs 77.77% <100.00%> (ø)
src/daft-parquet/src/utils.rs 81.52% <ø> (ø)
src/daft-scan/src/hive.rs 98.23% <100.00%> (-0.03%) ⬇️
src/daft-csv/src/local.rs 86.39% <94.11%> (+0.25%) ⬆️
src/daft-csv/src/read.rs 94.12% <93.37%> (-0.53%) ⬇️
src/daft-decoding/src/deserialize.rs 57.30% <51.78%> (-7.15%) ⬇️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@universalmind303 universalmind303 marked this pull request as ready for review February 27, 2026 18:37
@universalmind303
Copy link
Member Author

@greptileai

@universalmind303 universalmind303 enabled auto-merge (squash) February 27, 2026 18:50
Copy link
Collaborator

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@universalmind303 universalmind303 merged commit e9bfeac into main Feb 27, 2026
36 checks passed
@universalmind303 universalmind303 deleted the universalmind303/arrow-rs-csv branch February 27, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants