GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597
GH-3596: Add RowRanges.Builder for incremental construction from selected row indices#3597peter-toth wants to merge 5 commits into
RowRanges.Builder for incremental construction from selected row indices#3597Conversation
…m selected row indices ### Rationale for this change Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time. ### What changes are included in this PR? Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException. ### Are these changes tested? Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths. ### Are there any user-facing changes? No. Closes apache#3596 Co-authored-by: Matt Butrovich <mbutrovich@gmail.com>
e11f3d9 to
58857df
Compare
|
@wgtmac, could you please take a look at this PR? |
| * @return the constructed {@link RowRanges}, or {@link RowRanges#EMPTY} when no rows were | ||
| * selected. | ||
| */ | ||
| public RowRanges build() { |
There was a problem hiding this comment.
build() should return a snapshot or become terminal. As written, the returned RowRanges shares the builder’s mutable ranges list, so later calls on the same builder can mutate a previously built result and even break the sorted-ranges invariant.
| * @param blockRow the row index to mark selected (must be {@code >} the last value passed) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelected(long blockRow) { |
There was a problem hiding this comment.
addSelected should reject negative row indexes explicitly. -1 also collides with the builder’s sentinel state, so invalid input can be silently swallowed or corrupt the active run.
| @@ -316,4 +316,72 @@ public List<Range> getRanges() { | |||
| public String toString() { | |||
| return ranges.toString(); | |||
| } | |||
|
|
|||
| /** | |||
There was a problem hiding this comment.
This needs a clearer API commitment. The motivation is external readers/Spark, but the type lives under internal and the PR says there are no user-facing changes. If this is meant to be supported externally, the package/API contract should say so; otherwise downstream users may depend on an unsupported API.
There was a problem hiding this comment.
Yeah, at the time, it was yet another approach to tighten the amount of publicly available API in parquet-java. It quickly turned out that 3rd parties want to use it. For the next major release it is a good candidate to be moved to a more public package. Maybe deprecate it here, and also create at another location?
There was a problem hiding this comment.
@wgtmac, @gszadovszky, I wonder which package could be the right place for RowRanges and the new builder? Shall I try to move/deprecate them in this PR or separate one?
There was a problem hiding this comment.
A proper target package would be the same as now just skip the internal part. I am fine moving/deprecating in this PR. Theoretically, we can simply move it without deprecation since it is "internal" currently, so we have no guarantees for backward compatibility. WDYT, @wgtmac?
There was a problem hiding this comment.
Ok, moved in 8d8e2dd.
One thing to flag: the relocation changes a public signature ParquetFileReader.readFilteredRowGroup(int, RowRanges) now takes the new public RowRanges instead of the internal one. japicmp reports this as METHOD_REMOVED (and the inherited form on CompressionConverter$TransParquetFileReader), so I added excludes for:
org.apache.parquet.internal.filter2.columnindex.RowRanges(+$Range)org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter(return-type changes)org.apache.parquet.hadoop.ParquetFileReader#readFilteredRowGroup(int, …internal…RowRanges)(+ theTransParquetFileReaderinherited form)
There was a problem hiding this comment.
Oh, I haven't noticed that, sorry. Then, we probably need to get back to the deprecated/move pattern. 😞
Let's move to the new place, extend the class as required at the old place and deprecate. Also deprecate the public API that references the old place and create new methods for the new one. Mark 2.0 as the removal version. WDYT?
There was a problem hiding this comment.
I looked at the released (1.17.0) surface: the only genuinely-public, cross-package method exposing RowRanges is ParquetFileReader.readFilteredRowGroup(int, RowRanges). ParquetFileReader.getRowRanges(int), that returns RowRanges was only made public on master (1.18.0).
Everything else is under org.apache.parquet.internal.*.
Is it ok if we preserve compatibility for just readFilteredRowGroup(int, RowRanges) and treat the internal.* relocations as internal changes with no compat guarantee?
| * RowRanges ranges = builder.build(); | ||
| * }</pre> | ||
| */ | ||
| public static class Builder { |
There was a problem hiding this comment.
Consider making Builder final and adding an explicit constructor. Otherwise subclassing and new RowRanges.Builder() become part of the public API surface by accident.
| * @param blockRow the row index to mark selected (must be {@code >} the last value passed) | ||
| * @return this builder for chaining | ||
| */ | ||
| public Builder addSelected(long blockRow) { |
There was a problem hiding this comment.
The naming could be clearer before this API ships. addSelected(long blockRow) is ambiguous; something like addSelectedRow(long rowIndex) would better communicate that the value is a 0-based row index within the current row group.
|
Do you have any advice on this new API? @gszadovszky |
Rationale for this change
This PR is based on @mbutrovich's previous work.
Opening up APIs needed by a later materialization feature in Spark. External readers need to assemble a RowRanges incrementally from a stream of selected row indices (e.g. produced by a downstream filter or join) without having to know page boundaries ahead of time.
What changes are included in this PR?
Adds a Builder to RowRanges that takes a strictly-increasing sequence of selected row indices via addSelected(long) and coalesces consecutive indices into Range entries. Out-of-order or duplicate calls throw IllegalArgumentException.
Are these changes tested?
Yes. TestRowRanges covers single/multiple/coalesced ranges, the empty builder case, and the out-of-order/duplicate rejection paths.
Are there any user-facing changes?
No.
Closes #3596