-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem? Please describe.
Currently cudf::io::parquet_reader_options has methods to set columns to selectively read using column names but there is no way to set column indices to read for cases when we don't know/care about the column names stored in the file.
Describe the solution you'd like
Adding an alternative method to select column by indices.
Additional context
When registering parquet files as datasets, Presto can store a table schema that has different column names from what the parquet file has. Due to this, when selecting columns to read in presto, it can pass a set of column names which don't exist in the dataset's parquet file, resulting in reading no data.
e.g. A parquet file with columns "string_col_1, float_col_2" could be registered with names like "str1, float2". When reading, the SQL passed to Presto could look like
SELECT str1 FROM tbl;This would request cudf parquet reader to read column "str1" from the file and it would return an empty table because "str1" doesn't exist in the file schema.
With the feature requested in this issue, we could translate column selection in hive from schema to indices in presto-velox layer and pass to cudf::io::parquet_reader_options
Describe alternatives you've considered
Ad-hoc reading parquet schema in presto-velox and using it to translate hive schema column names to column names in the file.