Consider a default `max_statistics_truncate_length` 

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
By default the arrow-rs parquet writer will save the entire actual min and max values for any column that has statistics enabled into the page metadata

For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each page header

This can lead to pathalogical cases such as described in
- https://github.com/apache/arrow-rs/issues/7489

It is possible to control the maximum size of the values using 
1. [`WriterPropertiesBuilder::set_statistics_truncate_length`](https://arrow.apache.org/rust/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_statistics_truncate_length) 
2. [`WriterPropertiesBuilder::set_column_index_truncate_length`](https://arrow.apache.org/rust/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_column_index_truncate_length)

However the values currently defaults to `None` (unlimited)

I also think it is unlikely that the actual min/max values for large string columns will add significantly better pruning. 

**Describe the solution you'd like**
I propose we set the default statistics truncate length to a non None value to avoid pathalogical cases


**Describe alternatives you've considered**
I would propose picking a value like `128` that is long enough to capture all primitive data types and 
"sort" strings. 

We can (and should) also document the default better

**Additional context**
- related to https://github.com/apache/arrow-rs/issues/7489
- https://github.com/apache/arrow/issues/46404
- https://github.com/kylebarron/arro3/issues/324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider a default `max_statistics_truncate_length` #7490

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider a default max_statistics_truncate_length #7490

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Consider a default `max_statistics_truncate_length` #7490