Skip to content

[FEA] In Parquet reader, allow selecting columns to read by column index #21074

@devavret

Description

@devavret

Is your feature request related to a problem? Please describe.
Currently cudf::io::parquet_reader_options has methods to set columns to selectively read using column names but there is no way to set column indices to read for cases when we don't know/care about the column names stored in the file.

Describe the solution you'd like
Adding an alternative method to select column by indices.

Additional context
When registering parquet files as datasets, Presto can store a table schema that has different column names from what the parquet file has. Due to this, when selecting columns to read in presto, it can pass a set of column names which don't exist in the dataset's parquet file, resulting in reading no data.
e.g. A parquet file with columns "string_col_1, float_col_2" could be registered with names like "str1, float2". When reading, the SQL passed to Presto could look like

SELECT str1 FROM tbl;

This would request cudf parquet reader to read column "str1" from the file and it would return an empty table because "str1" doesn't exist in the file schema.

With the feature requested in this issue, we could translate column selection in hive from schema to indices in presto-velox layer and pass to cudf::io::parquet_reader_options

Describe alternatives you've considered
Ad-hoc reading parquet schema in presto-velox and using it to translate hive schema column names to column names in the file.

Metadata

Metadata

Assignees

Labels

cuIOcuIO issuefeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions