Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
By default the arrow-rs parquet writer will save the entire actual min and max values for any column that has statistics enabled into the page metadata
For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each page header
This can lead to pathalogical cases such as described in
It is possible to control the maximum size of the values using
WriterPropertiesBuilder::set_statistics_truncate_length
WriterPropertiesBuilder::set_column_index_truncate_length
However the values currently defaults to None
(unlimited)
I also think it is unlikely that the actual min/max values for large string columns will add significantly better pruning.
Describe the solution you'd like
I propose we set the default statistics truncate length to a non None value to avoid pathalogical cases
Describe alternatives you've considered
I would propose picking a value like 128
that is long enough to capture all primitive data types and
"sort" strings.
We can (and should) also document the default better
Additional context
- related to Files containing binary data with >=8_388_855 bytes per row written with
arrow-rs
can't be read withpyarrow
#7489 - [C++][Python][Parquet] Files with very large data page header can't be read with
pyarrow
arrow#46404 - Files containing binary data with >=8_388_855 bytes per row written with
arro3
can't be read withpyarrow
kylebarron/arro3#324