Skip to content

Commit d727621

Browse files
committed
docs: update ralph status — iteration 18 (26/43 within 2x, columnar cache for COUNT DISTINCT)
1 parent ee5989b commit d727621

2 files changed

Lines changed: 73 additions & 46 deletions

File tree

.ralph/LOGS.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -597,3 +597,41 @@ REJECTED: All 6 success criteria NOT MET — score unchanged at 25/43 (target >=
597597
3. **Performance ceiling on r5.4xlarge: 26-27/43**: Borderline queries need 10-20% improvement that cannot come from code changes. Need m5.8xlarge (32 vCPU) or architectural changes.
598598
4. **Q03 is noise-dependent**: Ranges from 1.76x to 2.65x across runs. Not reliably within 2x.
599599
5. **Q29 is noise-dependent**: Ranges from 188ms to 242ms (target 192ms). Sometimes within 2x in isolation but not in sequential benchmark runs.
600+
601+
## Iteration 18 — 2026-03-30T02:41-03:45Z
602+
603+
### What I Did
604+
1. Assessed environment: r5.4xlarge (16 vCPU, 124GB RAM), 48GB heap, wukong branch, OpenSearch green
605+
2. Explored optimization paths via deep codebase analysis:
606+
- Analyzed all fast paths in TransportShardExecuteAction dispatch
607+
- Analyzed FusedGroupByAggregate execution paths and parallelism
608+
- Investigated DirectReader/PackedInts bypass for sub-1ns per-value reads
609+
- Identified loadNumericColumn (columnar cache) as underutilized — only used in COUNT(*) path
610+
3. Implemented columnar cache for scanSegmentForCountDistinct (Q08 path):
611+
- Loads both key columns (RegionID, UserID) into long[] arrays via loadNumericColumn
612+
- Eliminates per-doc nextDoc()/nextValue() overhead in the hot loop
613+
- Preserves fallback for segments with deleted docs (liveDocs != null)
614+
4. Tried extending columnar cache to executeMixedDedupWithHashSets (Q09 path):
615+
- Loaded key columns AND aggregate columns into long[][] arrays
616+
- REGRESSED: 3.287s → 3.749s due to memory pressure from loading multiple large arrays in parallel workers
617+
- REVERTED this change
618+
5. Ran correctness: 39/43 PASS (no regression)
619+
6. Benchmarked Q08 in isolation: 2.309s → 2.274s (1.5% improvement)
620+
7. Ran full benchmark: 26/43 within 2x
621+
8. Q03 improved from 2.12x to 1.57x (noise-dependent, solidly within 2x in this run)
622+
9. Q29 improved from 2.38x to 2.16x (still above 2x)
623+
624+
### Results
625+
- Score: 26/43 within 2x (same as iter17 clean run)
626+
- Correctness: 39/43 PASS (unchanged)
627+
- Q03: 0.235s → 0.174s (2.12x → 1.57x) — noise-dependent improvement
628+
- Q08: 2.309s → 2.274s (4.28x → 4.21x in isolation) — marginal columnar cache benefit
629+
- Q09 columnar cache: REVERTED (regression from memory pressure)
630+
- Q29: 0.228s → 0.207s (2.38x → 2.16x) — noise-dependent improvement
631+
- Q16 GC cascade: continues to affect Q15-Q27 in full benchmark runs
632+
633+
### Decisions
634+
1. **scanSegmentForCountDistinct columnar cache KEPT**: Marginal improvement (~1.5%) for Q08 by loading both key columns into flat arrays. The bottleneck is LongOpenHashSet.add() operations, not DocValues reads.
635+
2. **executeMixedDedupWithHashSets columnar cache REVERTED**: Loading multiple large arrays (key0, key1, agg columns) simultaneously in parallel workers causes memory pressure and GC storms. The per-segment memory cost is ~200MB per column × 4 columns = ~800MB, which competes with the LongOpenHashSet allocations.
636+
3. **Performance ceiling confirmed on r5.4xlarge**: 18 iterations of optimization have exhausted code-level improvements. The remaining gap is fundamental Lucene DocValues overhead (3-5ns per value vs ClickHouse's 0.5-1ns). To reach ≥38/43: need m5.8xlarge (32 vCPU), DirectReader bypass, or custom columnar storage.
637+
4. **DirectReader bypass identified as next frontier**: Lucene's internal DirectReader.getInstance() provides O(1) random access at ~1-1.5ns per value. Requires accessing package-private NumericEntry metadata via reflection or codec fork. This is the "nuclear option" that could close the 2-3x gap for borderline queries.

.ralph/STATUS.md

Lines changed: 35 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,46 @@
11
status: WORKING
2-
iteration: 17
2+
iteration: 18
33

44
## Current State
5-
- Score: 26/43 within 2x (best clean run), 25/43 in noisy runs (Q03 is swing query at 1.86-2.12x)
5+
- Score: 26/43 within 2x (full run), same as iter17 clean run
66
- Correctness: 39/43 PASS (no regression)
7-
- Machine: r5.4xlarge (16 vCPU, 124GB RAM), 49GB heap, 4 shards, 4 segments/shard
8-
- No code changes this iteration — all optimizations from iterations 1-16 are in place
7+
- Machine: r5.4xlarge (16 vCPU, 124GB RAM), 48GB heap, 4 shards, 4 segments/shard
8+
- Code change: columnar cache for scanSegmentForCountDistinct (Q08 path)
99

10-
## Queries Within 2x (26, clean run)
11-
Q00(0.36x) Q01(0.16x) Q03(1.86x) Q06(0.12x) Q07(0.36x) Q10(0.81x) Q12(0.55x)
12-
Q17(0.01x) Q19(0.08x) Q20(0.01x) Q21(0.02x) Q22(0.03x) Q23(0.00x) Q24(0.02x)
13-
Q25(1.58x) Q26(0.03x) Q31(1.14x) Q32(1.77x) Q33(0.35x) Q34(0.35x) Q36(1.17x)
14-
Q37(0.45x) Q38(0.51x) Q40(0.20x) Q41(0.73x) Q42(0.50x)
10+
## Queries Within 2x (26)
11+
Q00(0.36x) Q01(0.14x) Q03(1.57x) Q06(0.12x) Q07(0.36x) Q10(0.84x) Q12(0.55x)
12+
Q17(0.01x) Q19(0.08x) Q20(0.01x) Q21(0.02x) Q22(0.04x) Q23(0.00x) Q24(0.02x)
13+
Q25(1.68x) Q26(0.05x) Q31(1.12x) Q32(1.69x) Q33(0.31x) Q34(0.28x) Q36(1.17x)
14+
Q37(0.44x) Q38(0.53x) Q40(0.21x) Q41(0.73x) Q42(0.52x)
1515

1616
## Queries Above 2x (17, sorted by ratio)
17-
Q28(2.24x) Q29(2.30x) Q27(2.39x) Q30(2.43x) Q14(2.44x) Q02(3.93x) Q35(4.18x)
18-
Q08(4.37x) Q05(4.89x) Q04(5.17x) Q09(5.40x) Q16(6.40x) Q11(6.43x) Q13(7.84x)
19-
Q18(9.59x) Q39(27.13x) Q15(32.34x)
20-
21-
## Exhaustive Analysis of Remaining Optimization Paths
22-
23-
### Fundamental Bottleneck: Lucene DocValues Decode Overhead
24-
- Per-doc decode: ~2-5ns per nextDoc()+nextValue() (variable-length integer decompression)
25-
- ClickHouse Parquet: bulk SIMD-optimized column reads, ~0.5-1ns per value
26-
- This 2-5x gap is NOT fixable with code optimizations — requires storage format changes
27-
- 17 iterations of optimization have exhausted all code-level improvements
28-
29-
### All Handover Steps Already Implemented
30-
1. **COUNT(DISTINCT) fusion**: PlanFragmenter decomposes, TransportShardExecuteAction routes to 5 specialized paths
31-
2. **executeSingleKeyNumericFlat parallelism**: Both doc-range and segment-level parallelism
32-
3. **Hash-partitioned aggregation**: Implemented for high-cardinality GROUP BY
33-
4. **Borderline optimizations**: All borderline queries hit optimized fused paths
34-
5. **REGEXP_REPLACE caching**: Pattern cached, ordinal-based evaluation, ultra-fast group extraction
35-
36-
### What Was Tried This Iteration
37-
- Segment-parallel optimization for N-key varchar path (Q14): REVERTED — HashMap merge overhead exceeds parallelism benefit
38-
- Analyzed all borderline queries (Q28, Q29, Q27, Q30, Q14, Q02): all hit optimized code paths
39-
- Q29 is noise-dependent (188-242ms, target 192ms) — sometimes within 2x in isolation
40-
41-
### Performance Ceiling on r5.4xlarge
42-
- Realistic ceiling: 26-27/43 with noise-dependent Q03
43-
- Borderline queries (Q28, Q29, Q27, Q30, Q14) are 2.2-2.5x — need 10-20% improvement
44-
- The 10-20% gap is fundamental Lucene DocValues overhead, not code inefficiency
45-
- To reach ≥38/43: need m5.8xlarge (32 vCPU) or architectural changes (columnar storage, vectorized execution)
17+
Q29(2.16x) Q28(2.24x) Q14(2.28x) Q30(2.40x) Q27(2.73x) Q02(3.52x) Q35(4.00x)
18+
Q08(4.29x) Q04(5.04x) Q05(5.45x) Q09(5.53x) Q11(6.88x) Q16(7.08x) Q13(7.78x)
19+
Q18(9.78x) Q39(29.42x) Q15(32.89x)
20+
21+
## What Was Done This Iteration
22+
1. Explored unexplored optimization paths: columnar cache extension, DirectReader bypass, batch DV reads
23+
2. Implemented columnar cache for scanSegmentForCountDistinct (Q08 path): loads both key columns into long[] arrays before iterating
24+
3. Tried extending columnar cache to executeMixedDedupWithHashSets (Q09 path) — REVERTED due to memory pressure regression
25+
4. Q03 improved from 2.12x to 1.57x (noise-dependent, now solidly within 2x in this run)
26+
5. Q08 columnar cache shows marginal improvement in isolation (2.309s → 2.274s, ~1.5%)
27+
28+
## Performance Ceiling Analysis
29+
- 18 iterations of optimization have exhausted code-level improvements on r5.4xlarge
30+
- Remaining gap is fundamental: Lucene DocValues per-doc decode (~3-5ns) vs ClickHouse columnar bulk reads (~0.5-1ns)
31+
- Borderline queries (Q29 2.16x, Q28 2.24x, Q14 2.28x, Q30 2.40x) need 10-20% improvement
32+
- The 10-20% gap cannot be closed with code optimizations — requires either:
33+
1. More CPU (m5.8xlarge with 32 vCPU)
34+
2. Bypassing Lucene DocValues API (DirectReader/PackedInts access)
35+
3. Custom columnar storage format
4636

4737
## Next Steps
48-
1. **Move to m5.8xlarge (32 vCPU)**: Doubling CPU count would halve per-shard execution time, potentially bringing borderline queries within 2x
49-
2. **Columnar storage format**: Replace Lucene DocValues with a columnar format (Arrow, Parquet) for bulk vectorized reads
50-
3. **Vectorized execution**: Use SIMD instructions for batch aggregation instead of per-doc scalar operations
51-
4. **Q16/Q18 OOM mitigation**: These queries cause GC cascades that affect subsequent queries in benchmark runs
38+
1. **DirectReader bypass**: Access Lucene's internal packed integer data directly, bypassing SortedNumericDocValues API. Could reduce per-value cost from ~3ns to ~1ns. Requires reflection or codec fork.
39+
2. **Move to m5.8xlarge**: Doubling CPU count would halve per-shard execution time, potentially bringing borderline queries within 2x.
40+
3. **Q16 OOM mitigation**: Q16 causes GC cascades that affect Q15-Q27 in sequential benchmark runs.
5241

5342
## Evidence
54-
- Clean benchmark: /tmp/iter17_baseline2/r5.4xlarge.json (26/43 within 2x)
55-
- Final benchmark: /tmp/iter17_final/r5.4xlarge.json (25/43, Q03 noise-dependent)
56-
- Correctness: 39/43 PASS (/tmp/correctness_iter17b.log)
57-
- Build: BUILD SUCCESSFUL (no code changes)
43+
- Full benchmark: /tmp/iter18_full/r5.4xlarge.json (26/43 within 2x)
44+
- Q08 isolated: /tmp/iter18_q08_isolated/r5.4xlarge.json (2.274s best)
45+
- Correctness: 39/43 PASS (/tmp/correctness_iter18c.log)
46+
- Build: BUILD SUCCESSFUL

0 commit comments

Comments
 (0)