Skip to content

Consider a default max_statistics_truncate_length  #7490

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
By default the arrow-rs parquet writer will save the entire actual min and max values for any column that has statistics enabled into the page metadata

For large binary/string columns (think JSON blobs), this means that two (a min and a max) potentially large values will be stored in both the file level metadata as well as in each page header

This can lead to pathalogical cases such as described in

It is possible to control the maximum size of the values using

  1. WriterPropertiesBuilder::set_statistics_truncate_length
  2. WriterPropertiesBuilder::set_column_index_truncate_length

However the values currently defaults to None (unlimited)

I also think it is unlikely that the actual min/max values for large string columns will add significantly better pruning.

Describe the solution you'd like
I propose we set the default statistics truncate length to a non None value to avoid pathalogical cases

Describe alternatives you've considered
I would propose picking a value like 128 that is long enough to capture all primitive data types and
"sort" strings.

We can (and should) also document the default better

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions