You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using native_datafusion or native_iceberg_compat Parquet readers based on DataFusion's DataSourceExec, the schemas that Comet passes in result in dictionaries being unpacked immediately.
The challenge is similar to int96, where the native side doesn't really have the Parquet schema when generating the DataSourceExec. We'd either need to pass this from early on the Spark side when the schema is first read, or add a coercion rule to DataFusion.
Additional context
No response
The text was updated successfully, but these errors were encountered:
What is the problem the feature request solves?
When using
native_datafusion
ornative_iceberg_compat
Parquet readers based on DataFusion's DataSourceExec, the schemas that Comet passes in result in dictionaries being unpacked immediately.Describe the potential solution
Arrow-rs will use a provided schema as a hint, and in the case of dictionary encoded columns, preserve the encoding:
https://github.com/apache/arrow-rs/blob/880be2f0a0b9675d8b42206e70543472a58792aa/parquet/src/arrow/schema/primitive.rs#L91
The challenge is similar to int96, where the native side doesn't really have the Parquet schema when generating the DataSourceExec. We'd either need to pass this from early on the Spark side when the schema is first read, or add a coercion rule to DataFusion.
Additional context
No response
The text was updated successfully, but these errors were encountered: