[FEA] In Parquet reader, allow selecting columns to read by column index

**Is your feature request related to a problem? Please describe.**
Currently `cudf::io::parquet_reader_options` has methods to set columns to selectively read using column names but there is no way to set column indices to read for cases when we don't know/care about the column names stored in the file.

**Describe the solution you'd like**
Adding an alternative method to select column by indices. 

**Additional context**
When registering parquet files as datasets, Presto can store a table schema that has different column names from what the parquet file has. Due to this, when selecting columns to read in presto, it can pass a set of column names which don't exist in the dataset's parquet file, resulting in reading no data.
e.g. A parquet file with columns  "string_col_1, float_col_2" could be registered with names like "str1, float2". When reading, the SQL passed to Presto could look like 
```sql
SELECT str1 FROM tbl;
```
This would request cudf parquet reader to read column "str1" from the file and it would return an empty table because "str1" doesn't exist in the file schema.

With the feature requested in this issue, we could translate column selection in hive from schema to indices in presto-velox layer and pass to `cudf::io::parquet_reader_options`

**Describe alternatives you've considered**
Ad-hoc reading parquet schema in presto-velox and using it to translate hive schema column names to column names in the file.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] In Parquet reader, allow selecting columns to read by column index #21074

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] In Parquet reader, allow selecting columns to read by column index #21074

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions