feat: prune bulk memtable parts by first tag#7911
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces batch-level pruning for bulk memtables by extracting min/max statistics from the first tag of encoded primary keys. It implements a BatchStats structure and a PruningStatistics adapter to leverage DataFusion's pruning logic during scans. The review feedback identifies a logic error when using sparse primary key encoding, suggests caching statistics earlier in the write path to avoid redundant computations during every scan, and recommends lowering the log level for pruning events to reduce performance overhead and noise.
9db208b to
fbb038a
Compare
5ef57f8 to
bc435a7
Compare
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Document sparse encoding format in SparsePrimaryKeyCodec and add comment explaining why primary_key.first() works for both encodings. Remove noisy info-level pruning logs from the read path. Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
bc435a7 to
4db4fac
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4db4fac805
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4ee98a13d4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
I hereby agree to the terms of the GreptimeDB CLA.
Refer to a related PR or issue link (optional)
What's changed and what's your intention?
Prune the parts in the bulk memtable by first tag's min max statistics. Currently, we only use the statistics of the first tag because they are very cheap to collect.
This can reduce the scan cost if the min max statistics can prune some parts. In my dataset, it saved 20% scan time spent on memtable.
PR Checklist
Please convert it to a draft if some of the following conditions are not met.