Skip to content

feat: implement last row cache reader for flat format#7757

Merged
evenyag merged 13 commits intoGreptimeTeam:mainfrom
evenyag:feat/flat-last-row-cache
Mar 6, 2026
Merged

feat: implement last row cache reader for flat format#7757
evenyag merged 13 commits intoGreptimeTeam:mainfrom
evenyag:feat/flat-last-row-cache

Conversation

@evenyag
Copy link
Contributor

@evenyag evenyag commented Mar 4, 2026

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

#7732

What's changed and what's your intention?

  • Add FlatRowGroupLastRowCachedReader and FlatLastTimestampSelector to support last-row selection
    and caching for the flat SST format, mirroring the existing primary-key format cache
  • Extend SelectorResultValue with a SelectorResult enum to store either primary-key Batch or
    flat RecordBatch cache entries
  • Wire the flat last-row reader into the file range read path, enabling it when the row group is
    PUT-only and all rows are selected

TSBS lastpoint without cache:

min:  2087.04ms, med:  2094.78ms, mean:  2111.82ms, max: 2264.96ms, stddev:    51.22ms, sum:  21.1sec, count: 10
all queries             :
min:  2087.04ms, med:  2094.78ms, mean:  2111.82ms, max: 2264.96ms, stddev:    51.22ms, sum:  21.1sec, count: 10

with cache

min:   385.46ms, med:   395.38ms, mean:   579.50ms, max: 2256.26ms, stddev:   558.91ms, sum:   5.8sec, count: 10
all queries             :
min:   385.46ms, med:   395.38ms, mean:   579.50ms, max: 2256.26ms, stddev:   558.91ms, sum:   5.8sec, count: 10

Details

The existing last-row cache (RowGroupLastRowCachedReader) only works with the primary-key SST
format. This PR adds equivalent support for the flat format:

  • FlatLastTimestampSelector: Processes sorted flat RecordBatches, tracking primary key
    boundaries to emit only the row with the latest timestamp per series. Handles primary keys spanning
    multiple batches and deduplicates timestamps across batch boundaries.
  • FlatRowGroupLastRowCachedReader: Cache-aware reader that checks for cached flat results on
    hit, or reads through FlatRowGroupReader and populates the cache on miss.
  • SelectorResult enum: The cache value type now distinguishes between PrimaryKey(Vec<Batch>)
    and Flat(Vec<RecordBatch>) variants. Cache lookups validate both format and projection match before
    returning hits.
  • FileRange::flat_reader now accepts an optional TimeSeriesRowSelector and creates the
    last-row reader when applicable (PUT-only, all rows selected).
  • Primary key arrays are compacted during selection to reduce memory usage in cached results.

Future Work

The ScanHint rule will ensure it only passes the last row hint when there are only tag filters. So the last row cache doesn't consider field filters now. We can support a more flexible last row cache in the future.

PR Checklist

Please convert it to a draft if some of the following conditions are not met.

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.
  • API changes are backward compatible.
  • Schema or data changes are backward compatible.

evenyag added 8 commits March 4, 2026 12:20
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
@github-actions github-actions bot added size/M docs-not-required This change does not impact docs. labels Mar 4, 2026
@evenyag evenyag changed the title Feat/flat last row cache feat: implement last row cache reader for flat format Mar 4, 2026
Signed-off-by: evenyag <realevenyag@gmail.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces crucial performance enhancements for data retrieval in the flat SST format by adding a dedicated last-row caching mechanism. It extends the existing caching infrastructure to support RecordBatches, ensuring that queries requesting only the latest data per time series can benefit from cached results, thereby reducing redundant computations and improving read efficiency for this specific data layout.

Highlights

  • Flat Format Last Row Caching: Implemented FlatRowGroupLastRowCachedReader and FlatLastTimestampSelector to provide last-row selection and caching capabilities for the flat SST format, mirroring the existing primary-key format's caching mechanism.
  • Extended Selector Result Cache: The SelectorResultValue enum was extended with a SelectorResult variant to store cached entries in either the primary-key Batch format or the new flat RecordBatch format, allowing the cache to support both data representations.
  • Integration into Read Path: The new flat last-row reader is integrated into the file range read path, enabling its use when a row group contains only PUT operations and all rows are selected, optimizing queries for the latest data.
  • Primary Key Compaction: Primary key arrays are now compacted during selection to reduce memory usage in the cached results, specifically for the flat format.
Changelog
  • src/mito2/src/cache.rs
    • Imported RecordBatch and record_batch_estimated_size for flat format support.
    • Introduced SelectorResult enum to differentiate between primary-key Batch and flat RecordBatch cache entries.
    • Updated SelectorResultValue to use SelectorResult and added a new_flat constructor.
    • Modified estimated_size to correctly calculate memory usage for both primary-key and flat result types.
  • src/mito2/src/read/last_row.rs
    • Added FlatRowGroupLastRowCachedReader to manage cached last rows for the flat format.
    • Implemented FlatLastRowCacheReader for iterating over cached flat results.
    • Created FlatRowGroupLastRowReader to read and cache last rows from flat format row groups.
    • Developed FlatLastTimestampSelector to process RecordBatches, identify primary key boundaries, and select the latest timestamped rows.
    • Added helper functions primary_key_bytes_at, last_timestamp_start, and rebuild_pk_dictionary_for_key for flat format primary key handling.
    • Updated RowGroupLastRowCachedReader to verify the result format (PrimaryKey) during cache hits.
    • Adjusted LastRowCacheReader to only process PrimaryKey results.
    • Added comprehensive unit tests for FlatLastTimestampSelector.
  • src/mito2/src/read/prune.rs
    • Imported FlatRowGroupLastRowCachedReader for integration.
    • Extended FlatSource enum to include LastRow variant for flat last-row cached readers.
    • Added new_with_last_row_reader constructor to FlatPruneReader for initializing with the new flat last-row reader.
    • Updated metrics method in FlatPruneReader to merge metrics from the inner FlatRowGroupLastRowCachedReader.
  • src/mito2/src/read/scan_util.rs
    • Modified flat_reader call to pass the series_row_selector from the stream context, enabling conditional last-row reading.
  • src/mito2/src/sst/parquet/file_range.rs
    • Modified flat_reader to accept an optional TimeSeriesRowSelector parameter.
    • Implemented logic to conditionally use FlatRowGroupLastRowCachedReader if the selector is LastRow, the row group is PUT-only, and all rows are selected.
    • Instantiated FlatPruneReader with either a standard FlatRowGroupReader or the new FlatRowGroupLastRowCachedReader based on the selection criteria.
  • src/mito2/src/sst/parquet/format.rs
    • Increased visibility of primary_key_offsets function to pub(crate) to allow its use by the new flat format components.
Activity
  • The pull request introduces new functionality for flat format last-row caching and integrates it into the read path. No specific human activity (comments, reviews, approvals) is provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a last-row cache reader for the flat SST format, which is a valuable performance optimization for a common query pattern. The changes are well-structured, with a new SelectorResult enum to handle different cache entry formats and a FlatLastTimestampSelector for the core logic. The integration into the existing read path appears correct. I have one suggestion to improve the clarity and robustness of the new reader implementation, as detailed in the comment.

@evenyag evenyag closed this Mar 4, 2026
evenyag added 3 commits March 4, 2026 20:19
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
- Replace VecDeque with BatchBuffer struct for output buffering
- Remove rebuild_pk_dictionary_for_key as batches go directly into buffer
- Remove unused push method and make BatchBuffer pub(crate)
- Remove debug logging in maybe_update_cache

Signed-off-by: evenyag <realevenyag@gmail.com>
@evenyag evenyag reopened this Mar 4, 2026
@evenyag evenyag marked this pull request as ready for review March 4, 2026 15:17
@evenyag evenyag requested review from v0y4g3r and waynexia as code owners March 4, 2026 15:17
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c86866ddc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Contributor

@v0y4g3r v0y4g3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: evenyag <realevenyag@gmail.com>
@evenyag evenyag enabled auto-merge March 6, 2026 11:41
@evenyag evenyag added this pull request to the merge queue Mar 6, 2026
Merged via the queue into GreptimeTeam:main with commit 93c48a0 Mar 6, 2026
43 of 45 checks passed
@evenyag evenyag deleted the feat/flat-last-row-cache branch March 6, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-not-required This change does not impact docs. size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants