Skip to content

feat: bucketed scan for native_datafusion Parquet scan #1719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mbutrovich opened this issue May 7, 2025 · 1 comment
Open

feat: bucketed scan for native_datafusion Parquet scan #1719

mbutrovich opened this issue May 7, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@mbutrovich
Copy link
Contributor

What is the problem the feature request solves?

The native_datafusion Parquet scan does not support bucketed scan, and fails most of the tests in Spark's BucketedReadSuite without a fallback. With a bucketed scan, some partitions end up without a file to read so their PartitionedFile is empty.

Describe the potential solution

I don't think DataSourceExec will take no file at construction. We might need to replace that node with a different no-op node that generates an empty data set with the correct schema in the case of a bucketed scan when a partition has no corresponding file.

Additional context

No response

@mbutrovich
Copy link
Contributor Author

Interestingly, native_datafusion currently passes the "bucketed table" Comet test. I suspect the Spark SQL tests are doing some bucket pruning, which is where we end up with empty file paths for some of the partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant