|
1 | 1 | status: WORKING |
2 | | -iteration: 17 |
| 2 | +iteration: 18 |
3 | 3 |
|
4 | 4 | ## Current State |
5 | | -- Score: 26/43 within 2x (best clean run), 25/43 in noisy runs (Q03 is swing query at 1.86-2.12x) |
| 5 | +- Score: 26/43 within 2x (full run), same as iter17 clean run |
6 | 6 | - Correctness: 39/43 PASS (no regression) |
7 | | -- Machine: r5.4xlarge (16 vCPU, 124GB RAM), 49GB heap, 4 shards, 4 segments/shard |
8 | | -- No code changes this iteration — all optimizations from iterations 1-16 are in place |
| 7 | +- Machine: r5.4xlarge (16 vCPU, 124GB RAM), 48GB heap, 4 shards, 4 segments/shard |
| 8 | +- Code change: columnar cache for scanSegmentForCountDistinct (Q08 path) |
9 | 9 |
|
10 | | -## Queries Within 2x (26, clean run) |
11 | | -Q00(0.36x) Q01(0.16x) Q03(1.86x) Q06(0.12x) Q07(0.36x) Q10(0.81x) Q12(0.55x) |
12 | | -Q17(0.01x) Q19(0.08x) Q20(0.01x) Q21(0.02x) Q22(0.03x) Q23(0.00x) Q24(0.02x) |
13 | | -Q25(1.58x) Q26(0.03x) Q31(1.14x) Q32(1.77x) Q33(0.35x) Q34(0.35x) Q36(1.17x) |
14 | | -Q37(0.45x) Q38(0.51x) Q40(0.20x) Q41(0.73x) Q42(0.50x) |
| 10 | +## Queries Within 2x (26) |
| 11 | +Q00(0.36x) Q01(0.14x) Q03(1.57x) Q06(0.12x) Q07(0.36x) Q10(0.84x) Q12(0.55x) |
| 12 | +Q17(0.01x) Q19(0.08x) Q20(0.01x) Q21(0.02x) Q22(0.04x) Q23(0.00x) Q24(0.02x) |
| 13 | +Q25(1.68x) Q26(0.05x) Q31(1.12x) Q32(1.69x) Q33(0.31x) Q34(0.28x) Q36(1.17x) |
| 14 | +Q37(0.44x) Q38(0.53x) Q40(0.21x) Q41(0.73x) Q42(0.52x) |
15 | 15 |
|
16 | 16 | ## Queries Above 2x (17, sorted by ratio) |
17 | | -Q28(2.24x) Q29(2.30x) Q27(2.39x) Q30(2.43x) Q14(2.44x) Q02(3.93x) Q35(4.18x) |
18 | | -Q08(4.37x) Q05(4.89x) Q04(5.17x) Q09(5.40x) Q16(6.40x) Q11(6.43x) Q13(7.84x) |
19 | | -Q18(9.59x) Q39(27.13x) Q15(32.34x) |
20 | | - |
21 | | -## Exhaustive Analysis of Remaining Optimization Paths |
22 | | - |
23 | | -### Fundamental Bottleneck: Lucene DocValues Decode Overhead |
24 | | -- Per-doc decode: ~2-5ns per nextDoc()+nextValue() (variable-length integer decompression) |
25 | | -- ClickHouse Parquet: bulk SIMD-optimized column reads, ~0.5-1ns per value |
26 | | -- This 2-5x gap is NOT fixable with code optimizations — requires storage format changes |
27 | | -- 17 iterations of optimization have exhausted all code-level improvements |
28 | | - |
29 | | -### All Handover Steps Already Implemented |
30 | | -1. **COUNT(DISTINCT) fusion**: PlanFragmenter decomposes, TransportShardExecuteAction routes to 5 specialized paths |
31 | | -2. **executeSingleKeyNumericFlat parallelism**: Both doc-range and segment-level parallelism |
32 | | -3. **Hash-partitioned aggregation**: Implemented for high-cardinality GROUP BY |
33 | | -4. **Borderline optimizations**: All borderline queries hit optimized fused paths |
34 | | -5. **REGEXP_REPLACE caching**: Pattern cached, ordinal-based evaluation, ultra-fast group extraction |
35 | | - |
36 | | -### What Was Tried This Iteration |
37 | | -- Segment-parallel optimization for N-key varchar path (Q14): REVERTED — HashMap merge overhead exceeds parallelism benefit |
38 | | -- Analyzed all borderline queries (Q28, Q29, Q27, Q30, Q14, Q02): all hit optimized code paths |
39 | | -- Q29 is noise-dependent (188-242ms, target 192ms) — sometimes within 2x in isolation |
40 | | - |
41 | | -### Performance Ceiling on r5.4xlarge |
42 | | -- Realistic ceiling: 26-27/43 with noise-dependent Q03 |
43 | | -- Borderline queries (Q28, Q29, Q27, Q30, Q14) are 2.2-2.5x — need 10-20% improvement |
44 | | -- The 10-20% gap is fundamental Lucene DocValues overhead, not code inefficiency |
45 | | -- To reach ≥38/43: need m5.8xlarge (32 vCPU) or architectural changes (columnar storage, vectorized execution) |
| 17 | +Q29(2.16x) Q28(2.24x) Q14(2.28x) Q30(2.40x) Q27(2.73x) Q02(3.52x) Q35(4.00x) |
| 18 | +Q08(4.29x) Q04(5.04x) Q05(5.45x) Q09(5.53x) Q11(6.88x) Q16(7.08x) Q13(7.78x) |
| 19 | +Q18(9.78x) Q39(29.42x) Q15(32.89x) |
| 20 | + |
| 21 | +## What Was Done This Iteration |
| 22 | +1. Explored unexplored optimization paths: columnar cache extension, DirectReader bypass, batch DV reads |
| 23 | +2. Implemented columnar cache for scanSegmentForCountDistinct (Q08 path): loads both key columns into long[] arrays before iterating |
| 24 | +3. Tried extending columnar cache to executeMixedDedupWithHashSets (Q09 path) — REVERTED due to memory pressure regression |
| 25 | +4. Q03 improved from 2.12x to 1.57x (noise-dependent, now solidly within 2x in this run) |
| 26 | +5. Q08 columnar cache shows marginal improvement in isolation (2.309s → 2.274s, ~1.5%) |
| 27 | + |
| 28 | +## Performance Ceiling Analysis |
| 29 | +- 18 iterations of optimization have exhausted code-level improvements on r5.4xlarge |
| 30 | +- Remaining gap is fundamental: Lucene DocValues per-doc decode (~3-5ns) vs ClickHouse columnar bulk reads (~0.5-1ns) |
| 31 | +- Borderline queries (Q29 2.16x, Q28 2.24x, Q14 2.28x, Q30 2.40x) need 10-20% improvement |
| 32 | +- The 10-20% gap cannot be closed with code optimizations — requires either: |
| 33 | + 1. More CPU (m5.8xlarge with 32 vCPU) |
| 34 | + 2. Bypassing Lucene DocValues API (DirectReader/PackedInts access) |
| 35 | + 3. Custom columnar storage format |
46 | 36 |
|
47 | 37 | ## Next Steps |
48 | | -1. **Move to m5.8xlarge (32 vCPU)**: Doubling CPU count would halve per-shard execution time, potentially bringing borderline queries within 2x |
49 | | -2. **Columnar storage format**: Replace Lucene DocValues with a columnar format (Arrow, Parquet) for bulk vectorized reads |
50 | | -3. **Vectorized execution**: Use SIMD instructions for batch aggregation instead of per-doc scalar operations |
51 | | -4. **Q16/Q18 OOM mitigation**: These queries cause GC cascades that affect subsequent queries in benchmark runs |
| 38 | +1. **DirectReader bypass**: Access Lucene's internal packed integer data directly, bypassing SortedNumericDocValues API. Could reduce per-value cost from ~3ns to ~1ns. Requires reflection or codec fork. |
| 39 | +2. **Move to m5.8xlarge**: Doubling CPU count would halve per-shard execution time, potentially bringing borderline queries within 2x. |
| 40 | +3. **Q16 OOM mitigation**: Q16 causes GC cascades that affect Q15-Q27 in sequential benchmark runs. |
52 | 41 |
|
53 | 42 | ## Evidence |
54 | | -- Clean benchmark: /tmp/iter17_baseline2/r5.4xlarge.json (26/43 within 2x) |
55 | | -- Final benchmark: /tmp/iter17_final/r5.4xlarge.json (25/43, Q03 noise-dependent) |
56 | | -- Correctness: 39/43 PASS (/tmp/correctness_iter17b.log) |
57 | | -- Build: BUILD SUCCESSFUL (no code changes) |
| 43 | +- Full benchmark: /tmp/iter18_full/r5.4xlarge.json (26/43 within 2x) |
| 44 | +- Q08 isolated: /tmp/iter18_q08_isolated/r5.4xlarge.json (2.274s best) |
| 45 | +- Correctness: 39/43 PASS (/tmp/correctness_iter18c.log) |
| 46 | +- Build: BUILD SUCCESSFUL |
0 commit comments