|
| 1 | +# Task: Execute Phase D Handover — DQE Optimization 25/43 → 43/43 |
| 2 | + |
| 3 | +Execute the plan in `docs/handover/2026-03-24-phase-d-handover.md` to optimize the DQE (Direct Query Engine) for the ClickBench benchmark. The goal is to bring all 43 queries within 2x of ClickHouse-Parquet performance. |
| 4 | + |
| 5 | +## Current State |
| 6 | +- **Score:** 25/43 queries within 2x of ClickHouse-Parquet |
| 7 | +- **Branch:** `wukong` |
| 8 | +- **Correctness:** 33/43 pass on 1M dataset |
| 9 | +- **Hardware:** OpenSearch on m5.8xlarge (32 vCPU, 128GB RAM), 4 shards, ~100M docs |
| 10 | + |
| 11 | +## Success Criteria |
| 12 | +1. **Primary:** ≥38/43 queries within 2x of CH-Parquet (stretch: 43/43) |
| 13 | +2. **No regressions:** Correctness must stay ≥33/43 |
| 14 | +3. **No regressions:** Queries already within 2x must stay within 2x |
| 15 | +4. **Evidence:** Full benchmark run with comparison output after each optimization |
| 16 | + |
| 17 | +## Priority Order (from handover) |
| 18 | + |
| 19 | +### Step 1: COUNT(DISTINCT) Fusion (Q04, Q05, Q08, Q09, Q11, Q13 — 6 queries) |
| 20 | +Intercept the two-level Calcite plan at `TransportShardExecuteAction` dispatch level. Detect pattern: outer Aggregate(GROUP BY x, COUNT(*)) + inner Aggregate(GROUP BY x, y, COUNT(*)). Route to fused GROUP BY with per-group `LongOpenHashSet` accumulator. Key file: `TransportShardExecuteAction.java:280-360`. |
| 21 | + |
| 22 | +### Step 2: Parallelize executeSingleKeyNumericFlat (Q15 + similar) |
| 23 | +Q15 scans 100M rows sequentially. Split across parallel workers like `executeWithEval` already does. |
| 24 | + |
| 25 | +### Step 3: Hash-Partitioned Aggregation (Q16, Q18, Q32) |
| 26 | +Partition group-key space into buckets, process one bucket at a time, merge. Proven pattern from Q33/Q34. |
| 27 | + |
| 28 | +### Step 4: Borderline Queries (Q02, Q30, Q31, Q37) |
| 29 | +Small targeted optimizations. Q31 needs only 3ms improvement. |
| 30 | + |
| 31 | +### Step 5: Q28 REGEXP_REPLACE |
| 32 | +Cache compiled Pattern objects. Hoist regex computation before aggregation loop. |
| 33 | + |
| 34 | +### Step 6: Full-Table High-Cardinality VARCHAR (Q35, Q36, Q39) |
| 35 | +Hash-partitioned aggregation + parallel segment scanning. |
| 36 | + |
| 37 | +## Key Architecture |
| 38 | +Read the full handover doc at `docs/handover/2026-03-24-phase-d-handover.md` for: |
| 39 | +- Complete query status table with ratios |
| 40 | +- Code map and key source files |
| 41 | +- Known issues and pitfalls (query numbering, JIT warmup, plugin reload) |
| 42 | +- Build/test/benchmark commands |
| 43 | + |
| 44 | +## Build & Test Commands |
| 45 | +```bash |
| 46 | +# Compile DQE only (~5s) |
| 47 | +cd /home/ec2-user/oss/wukong && ./gradlew :dqe:compileJava |
| 48 | + |
| 49 | +# Full rebuild + restart + reinstall (~3 min) |
| 50 | +cd /home/ec2-user/oss/wukong/benchmarks/clickbench && bash run/run_all.sh reload-plugin |
| 51 | + |
| 52 | +# Correctness (1M dataset) |
| 53 | +bash run/run_all.sh correctness |
| 54 | + |
| 55 | +# Single query benchmark |
| 56 | +bash run/run_opensearch.sh --warmup 3 --num-tries 5 --query N --output-dir /tmp/qN_test |
| 57 | + |
| 58 | +# Full benchmark |
| 59 | +bash run/run_opensearch.sh --warmup 3 --num-tries 5 --output-dir /tmp/full_bench |
| 60 | +``` |
| 61 | + |
| 62 | +## CRITICAL WARNINGS |
| 63 | +- Query numbering: run script is 1-based, JSON results are 0-based |
| 64 | +- Always benchmark on full 100M `hits` index, NOT 1M `hits_1m` |
| 65 | +- Always use `--warmup 3` for JIT compilation |
| 66 | +- Compare against CH-Parquet official baseline, NOT native MergeTree |
| 67 | +- Baseline file: `benchmarks/clickbench/results/performance/clickhouse_parquet_official/c6a.4xlarge.json` |
| 68 | +- Never run benchmarks and `reload-plugin` concurrently |
| 69 | + |
| 70 | +## Approach |
| 71 | +Work one step at a time. After each optimization: |
| 72 | +1. Compile and verify no build errors |
| 73 | +2. Run correctness tests — must not regress below 33/43 |
| 74 | +3. Benchmark affected queries |
| 75 | +4. Run full benchmark to check for regressions |
| 76 | +5. Git commit with descriptive message |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Instructions |
| 81 | + |
| 82 | +You are executing one iteration of a ralph loop. |
| 83 | + |
| 84 | +1. Read `.ralph/STATUS.md` for current state (if it exists) |
| 85 | +2. Read the tail of `.ralph/LOGS.md` for recent history (if it exists) |
| 86 | +3. Do the work described above |
| 87 | +4. When done with this iteration, update both files: |
| 88 | + |
| 89 | +### STATUS.md (overwrite entirely) |
| 90 | +Write current state with this structure: |
| 91 | +``` |
| 92 | +status: WORKING | COMPLETE |
| 93 | +iteration: N |
| 94 | +
|
| 95 | +## Current State |
| 96 | +[What's the situation right now] |
| 97 | +
|
| 98 | +## Next Steps |
| 99 | +[What needs to happen next — omit if COMPLETE] |
| 100 | +
|
| 101 | +## Evidence |
| 102 | +[Test results, benchmark numbers, build output — whatever proves progress] |
| 103 | +``` |
| 104 | + |
| 105 | +Set `status: COMPLETE` only when ALL success criteria from the task are met with evidence. |
| 106 | + |
| 107 | +### LOGS.md (append a section) |
| 108 | +Append to the end: |
| 109 | +``` |
| 110 | +## Iteration N — [date/time] |
| 111 | +
|
| 112 | +### What I Did |
| 113 | +[Actions taken] |
| 114 | +
|
| 115 | +### Results |
| 116 | +[Outcomes, test results, errors] |
| 117 | +
|
| 118 | +### Decisions |
| 119 | +[Any architectural or approach decisions made and why] |
| 120 | +``` |
| 121 | + |
| 122 | +5. Git commit with a descriptive message and push |
0 commit comments