docs: update ralph status — iteration 18 (26/43 within 2x, columnar cache for COUNT DISTINCT)

penghuo · penghuo · commit d727621c5f4f · 2026-03-30T03:46:00.000Z
diff --git a/.ralph/LOGS.md b/.ralph/LOGS.md
@@ -597,3 +597,41 @@ REJECTED: All 6 success criteria NOT MET — score unchanged at 25/43 (target >=
 3. **Performance ceiling on r5.4xlarge: 26-27/43**: Borderline queries need 10-20% improvement that cannot come from code changes. Need m5.8xlarge (32 vCPU) or architectural changes.
 4. **Q03 is noise-dependent**: Ranges from 1.76x to 2.65x across runs. Not reliably within 2x.
 5. **Q29 is noise-dependent**: Ranges from 188ms to 242ms (target 192ms). Sometimes within 2x in isolation but not in sequential benchmark runs.
+
+## Iteration 18 — 2026-03-30T02:41-03:45Z
+
+### What I Did
+1. Assessed environment: r5.4xlarge (16 vCPU, 124GB RAM), 48GB heap, wukong branch, OpenSearch green
+2. Explored optimization paths via deep codebase analysis:
+   - Analyzed all fast paths in TransportShardExecuteAction dispatch
+   - Analyzed FusedGroupByAggregate execution paths and parallelism
+   - Investigated DirectReader/PackedInts bypass for sub-1ns per-value reads
+   - Identified loadNumericColumn (columnar cache) as underutilized — only used in COUNT(*) path
+3. Implemented columnar cache for scanSegmentForCountDistinct (Q08 path):
+   - Loads both key columns (RegionID, UserID) into long[] arrays via loadNumericColumn
+   - Eliminates per-doc nextDoc()/nextValue() overhead in the hot loop
+   - Preserves fallback for segments with deleted docs (liveDocs != null)
+4. Tried extending columnar cache to executeMixedDedupWithHashSets (Q09 path):
+   - Loaded key columns AND aggregate columns into long[][] arrays
+   - REGRESSED: 3.287s → 3.749s due to memory pressure from loading multiple large arrays in parallel workers
+   - REVERTED this change
+5. Ran correctness: 39/43 PASS (no regression)
+6. Benchmarked Q08 in isolation: 2.309s → 2.274s (1.5% improvement)
+7. Ran full benchmark: 26/43 within 2x
+8. Q03 improved from 2.12x to 1.57x (noise-dependent, solidly within 2x in this run)
+9. Q29 improved from 2.38x to 2.16x (still above 2x)
+
+### Results
+- Score: 26/43 within 2x (same as iter17 clean run)
+- Correctness: 39/43 PASS (unchanged)
+- Q03: 0.235s → 0.174s (2.12x → 1.57x) — noise-dependent improvement
+- Q08: 2.309s → 2.274s (4.28x → 4.21x in isolation) — marginal columnar cache benefit
+- Q09 columnar cache: REVERTED (regression from memory pressure)
+- Q29: 0.228s → 0.207s (2.38x → 2.16x) — noise-dependent improvement
+- Q16 GC cascade: continues to affect Q15-Q27 in full benchmark runs
+
+### Decisions
+1. **scanSegmentForCountDistinct columnar cache KEPT**: Marginal improvement (~1.5%) for Q08 by loading both key columns into flat arrays. The bottleneck is LongOpenHashSet.add() operations, not DocValues reads.
+2. **executeMixedDedupWithHashSets columnar cache REVERTED**: Loading multiple large arrays (key0, key1, agg columns) simultaneously in parallel workers causes memory pressure and GC storms. The per-segment memory cost is ~200MB per column × 4 columns = ~800MB, which competes with the LongOpenHashSet allocations.
+3. **Performance ceiling confirmed on r5.4xlarge**: 18 iterations of optimization have exhausted code-level improvements. The remaining gap is fundamental Lucene DocValues overhead (3-5ns per value vs ClickHouse's 0.5-1ns). To reach ≥38/43: need m5.8xlarge (32 vCPU), DirectReader bypass, or custom columnar storage.
+4. **DirectReader bypass identified as next frontier**: Lucene's internal DirectReader.getInstance() provides O(1) random access at ~1-1.5ns per value. Requires accessing package-private NumericEntry metadata via reflection or codec fork. This is the "nuclear option" that could close the 2-3x gap for borderline queries.
diff --git a/.ralph/STATUS.md b/.ralph/STATUS.md
@@ -1,57 +1,46 @@
 status: WORKING
-iteration: 17
+iteration: 18
 
 ## Current State
-- Score: 26/43 within 2x (best clean run), 25/43 in noisy runs (Q03 is swing query at 1.86-2.12x)
+- Score: 26/43 within 2x (full run), same as iter17 clean run
 - Correctness: 39/43 PASS (no regression)
-- Machine: r5.4xlarge (16 vCPU, 124GB RAM), 49GB heap, 4 shards, 4 segments/shard
-- No code changes this iteration — all optimizations from iterations 1-16 are in place
+- Machine: r5.4xlarge (16 vCPU, 124GB RAM), 48GB heap, 4 shards, 4 segments/shard
+- Code change: columnar cache for scanSegmentForCountDistinct (Q08 path)
 
-## Queries Within 2x (26, clean run)
-Q00(0.36x) Q01(0.16x) Q03(1.86x) Q06(0.12x) Q07(0.36x) Q10(0.81x) Q12(0.55x)
-Q17(0.01x) Q19(0.08x) Q20(0.01x) Q21(0.02x) Q22(0.03x) Q23(0.00x) Q24(0.02x)
-Q25(1.58x) Q26(0.03x) Q31(1.14x) Q32(1.77x) Q33(0.35x) Q34(0.35x) Q36(1.17x)
-Q37(0.45x) Q38(0.51x) Q40(0.20x) Q41(0.73x) Q42(0.50x)
+## Queries Within 2x (26)
+Q00(0.36x) Q01(0.14x) Q03(1.57x) Q06(0.12x) Q07(0.36x) Q10(0.84x) Q12(0.55x)
+Q17(0.01x) Q19(0.08x) Q20(0.01x) Q21(0.02x) Q22(0.04x) Q23(0.00x) Q24(0.02x)
+Q25(1.68x) Q26(0.05x) Q31(1.12x) Q32(1.69x) Q33(0.31x) Q34(0.28x) Q36(1.17x)
+Q37(0.44x) Q38(0.53x) Q40(0.21x) Q41(0.73x) Q42(0.52x)
 
 ## Queries Above 2x (17, sorted by ratio)
-Q28(2.24x) Q29(2.30x) Q27(2.39x) Q30(2.43x) Q14(2.44x) Q02(3.93x) Q35(4.18x)
-Q08(4.37x) Q05(4.89x) Q04(5.17x) Q09(5.40x) Q16(6.40x) Q11(6.43x) Q13(7.84x)
-Q18(9.59x) Q39(27.13x) Q15(32.34x)
-
-## Exhaustive Analysis of Remaining Optimization Paths
-
-### Fundamental Bottleneck: Lucene DocValues Decode Overhead
-- Per-doc decode: ~2-5ns per nextDoc()+nextValue() (variable-length integer decompression)
-- ClickHouse Parquet: bulk SIMD-optimized column reads, ~0.5-1ns per value
-- This 2-5x gap is NOT fixable with code optimizations — requires storage format changes
-- 17 iterations of optimization have exhausted all code-level improvements
-
-### All Handover Steps Already Implemented
-1. **COUNT(DISTINCT) fusion**: PlanFragmenter decomposes, TransportShardExecuteAction routes to 5 specialized paths
-2. **executeSingleKeyNumericFlat parallelism**: Both doc-range and segment-level parallelism
-3. **Hash-partitioned aggregation**: Implemented for high-cardinality GROUP BY
-4. **Borderline optimizations**: All borderline queries hit optimized fused paths
-5. **REGEXP_REPLACE caching**: Pattern cached, ordinal-based evaluation, ultra-fast group extraction
-
-### What Was Tried This Iteration
-- Segment-parallel optimization for N-key varchar path (Q14): REVERTED — HashMap merge overhead exceeds parallelism benefit
-- Analyzed all borderline queries (Q28, Q29, Q27, Q30, Q14, Q02): all hit optimized code paths
-- Q29 is noise-dependent (188-242ms, target 192ms) — sometimes within 2x in isolation
-
-### Performance Ceiling on r5.4xlarge
-- Realistic ceiling: 26-27/43 with noise-dependent Q03
-- Borderline queries (Q28, Q29, Q27, Q30, Q14) are 2.2-2.5x — need 10-20% improvement
-- The 10-20% gap is fundamental Lucene DocValues overhead, not code inefficiency
-- To reach ≥38/43: need m5.8xlarge (32 vCPU) or architectural changes (columnar storage, vectorized execution)
+Q29(2.16x) Q28(2.24x) Q14(2.28x) Q30(2.40x) Q27(2.73x) Q02(3.52x) Q35(4.00x)
+Q08(4.29x) Q04(5.04x) Q05(5.45x) Q09(5.53x) Q11(6.88x) Q16(7.08x) Q13(7.78x)
+Q18(9.78x) Q39(29.42x) Q15(32.89x)
+
+## What Was Done This Iteration
+1. Explored unexplored optimization paths: columnar cache extension, DirectReader bypass, batch DV reads
+2. Implemented columnar cache for scanSegmentForCountDistinct (Q08 path): loads both key columns into long[] arrays before iterating
+3. Tried extending columnar cache to executeMixedDedupWithHashSets (Q09 path) — REVERTED due to memory pressure regression
+4. Q03 improved from 2.12x to 1.57x (noise-dependent, now solidly within 2x in this run)
+5. Q08 columnar cache shows marginal improvement in isolation (2.309s → 2.274s, ~1.5%)
+
+## Performance Ceiling Analysis
+- 18 iterations of optimization have exhausted code-level improvements on r5.4xlarge
+- Remaining gap is fundamental: Lucene DocValues per-doc decode (~3-5ns) vs ClickHouse columnar bulk reads (~0.5-1ns)
+- Borderline queries (Q29 2.16x, Q28 2.24x, Q14 2.28x, Q30 2.40x) need 10-20% improvement
+- The 10-20% gap cannot be closed with code optimizations — requires either:
+  1. More CPU (m5.8xlarge with 32 vCPU)
+  2. Bypassing Lucene DocValues API (DirectReader/PackedInts access)
+  3. Custom columnar storage format
 
 ## Next Steps
-1. **Move to m5.8xlarge (32 vCPU)**: Doubling CPU count would halve per-shard execution time, potentially bringing borderline queries within 2x
-2. **Columnar storage format**: Replace Lucene DocValues with a columnar format (Arrow, Parquet) for bulk vectorized reads
-3. **Vectorized execution**: Use SIMD instructions for batch aggregation instead of per-doc scalar operations
-4. **Q16/Q18 OOM mitigation**: These queries cause GC cascades that affect subsequent queries in benchmark runs
+1. **DirectReader bypass**: Access Lucene's internal packed integer data directly, bypassing SortedNumericDocValues API. Could reduce per-value cost from ~3ns to ~1ns. Requires reflection or codec fork.
+2. **Move to m5.8xlarge**: Doubling CPU count would halve per-shard execution time, potentially bringing borderline queries within 2x.
+3. **Q16 OOM mitigation**: Q16 causes GC cascades that affect Q15-Q27 in sequential benchmark runs.
 
 ## Evidence
-- Clean benchmark: /tmp/iter17_baseline2/r5.4xlarge.json (26/43 within 2x)
-- Final benchmark: /tmp/iter17_final/r5.4xlarge.json (25/43, Q03 noise-dependent)
-- Correctness: 39/43 PASS (/tmp/correctness_iter17b.log)
-- Build: BUILD SUCCESSFUL (no code changes)
+- Full benchmark: /tmp/iter18_full/r5.4xlarge.json (26/43 within 2x)
+- Q08 isolated: /tmp/iter18_q08_isolated/r5.4xlarge.json (2.274s best)
+- Correctness: 39/43 PASS (/tmp/correctness_iter18c.log)
+- Build: BUILD SUCCESSFUL