|
| 1 | +# AGENTS.md — DQE Development & Benchmarking Guide |
| 2 | + |
| 3 | +## DQE Architecture (Quick Mental Model) |
| 4 | + |
| 5 | +``` |
| 6 | +SQL query → TransportTrinoSqlAction (coordinator) |
| 7 | + → PlanFragmenter (splits into shard plans) |
| 8 | + → TransportShardExecuteAction (per-shard dispatch) |
| 9 | + ├── FusedScanAggregate (scalar aggs: SUM, AVG, COUNT) |
| 10 | + ├── FusedGroupByAggregate (GROUP BY: varchar/numeric keys, flat/hash paths) |
| 11 | + └── Fast paths (COUNT DISTINCT: HashSet, bitset, ordinal) |
| 12 | + → Coordinator merges shard results → returns to client |
| 13 | +``` |
| 14 | + |
| 15 | +### Key Files |
| 16 | + |
| 17 | +| File | Lines | Purpose | |
| 18 | +|------|-------|---------| |
| 19 | +| `TransportShardExecuteAction.java` | ~2200 | Shard-level dispatch: routes queries to fast paths | |
| 20 | +| `FusedGroupByAggregate.java` | ~12700 | GROUP BY execution: varchar/numeric keys, flat arrays, collectors | |
| 21 | +| `FusedScanAggregate.java` | ~1800 | Scalar aggregation: SUM, AVG, COUNT, flat array path | |
| 22 | +| `TransportTrinoSqlAction.java` | ~4200 | Coordinator: plan optimization, shard fan-out, result merge | |
| 23 | + |
| 24 | +### Dispatch Priority (TransportShardExecuteAction.executePlan) |
| 25 | + |
| 26 | +1. Scalar agg → `FusedScanAggregate.canFuse()` → flat array path |
| 27 | +2. Bare single-column scan → COUNT(DISTINCT) fast paths |
| 28 | +3. 2-key COUNT(DISTINCT) → HashSet paths (numeric/varchar) |
| 29 | +4. Expression GROUP BY → ordinal-cached path |
| 30 | +5. Generic GROUP BY → `FusedGroupByAggregate.canFuse()` → fused path |
| 31 | +6. Fallback → generic pipeline |
| 32 | + |
| 33 | +### Filtered vs Unfiltered Queries |
| 34 | + |
| 35 | +- **MatchAllDocsQuery** (no WHERE): tight `for(doc=0; doc<maxDoc; doc++)` loop, sequential DV access |
| 36 | +- **Filtered** (WHERE clause): Collector-based `collect(int doc)` with virtual dispatch overhead |
| 37 | +- **Selective filter optimization**: bitset pre-collection for filters matching <50% of docs |
| 38 | + |
| 39 | +## Dev Iteration Loop |
| 40 | + |
| 41 | +1. Code change (edit Java files) |
| 42 | +2. Compile: `./gradlew :dqe:compileJava` |
| 43 | +3. Reload plugin — see Long-Running Task Rules |
| 44 | +4. Correctness gate — MUST be >= 38/43. If regression, STOP and fix. |
| 45 | +5. Benchmark target queries — see Long-Running Task Rules |
| 46 | + |
| 47 | +All steps 3-5 MUST follow the async execution pattern in Long-Running Task Rules. |
| 48 | + |
| 49 | +## Long-Running Task Rules |
| 50 | + |
| 51 | +Any command that may take longer than 2 minutes MUST be run asynchronously. This includes: benchmarks, plugin reload, correctness tests, compilation, and full benchmark suites. |
| 52 | + |
| 53 | +### Async Execution Pattern |
| 54 | + |
| 55 | +1. **NEVER run long-running commands synchronously** — always background and poll. |
| 56 | +2. **Launch in a subshell** so the parent shell returns immediately: |
| 57 | + ```bash |
| 58 | + nohup bash -c 'cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash run/run_all.sh reload-plugin > /tmp/reload.log 2>&1' &>/dev/null & |
| 59 | + echo "launched" |
| 60 | + ``` |
| 61 | + **CRITICAL**: Plain `nohup cmd &` or `(cmd &)` does NOT work — the shell hangs waiting for the background process. You MUST use `nohup bash -c '...' &>/dev/null &`. |
| 62 | +3. **Poll for completion** — check output tail for success/failure: |
| 63 | + ```bash |
| 64 | + tail -5 /tmp/reload.log |
| 65 | + ``` |
| 66 | +4. **Poll interval**: every 10-30s for benchmarks, every 30-60s for builds. |
| 67 | +5. **Analyze each poll result** — if ERROR/FAILURE appears in output, stop and diagnose immediately. |
| 68 | +6. **Monitoring IS the task** — never launch a long-running command and then do something else. |
| 69 | + |
| 70 | +### Common Long-Running Commands |
| 71 | + |
| 72 | +| Command | Est. Time | Output File | Completion Marker | Error Marker | |
| 73 | +|---------|-----------|-------------|-------------------|--------------| |
| 74 | +| `./gradlew :dqe:compileJava` | ~5s | `/tmp/compile.log` | `BUILD SUCCESSFUL` | `BUILD FAILED` | |
| 75 | +| `run_all.sh reload-plugin` | 2-3 min | `/tmp/reload.log` | `reloaded successfully` | `FAILED` or `Error` | |
| 76 | +| `run_all.sh correctness` | ~2 min | `/tmp/correctness.log` | `Summary:` | `Error` | |
| 77 | +| `run_opensearch.sh --query N` | ~1 min | `/tmp/bench-qN.log` | `Results written` | `Error` or `failed` | |
| 78 | +| `run_opensearch.sh` (full suite) | 5-15 min | `/tmp/bench-full.log` | `Results written` | `Error` or `failed` | |
| 79 | + |
| 80 | +### Multi-Query Benchmark with Monitoring |
| 81 | + |
| 82 | +```bash |
| 83 | +# Benchmark multiple queries sequentially, monitoring each |
| 84 | +for Q in 31 32 38 41; do |
| 85 | + LOG=/tmp/bench-q${Q}.log |
| 86 | + nohup bash -c "cd /local/home/penghuo/oss/os-sql/benchmarks/clickbench && bash run/run_opensearch.sh --warmup 1 --num-tries 3 --query $Q --output-dir /tmp/q${Q} > $LOG 2>&1" &>/dev/null & |
| 87 | + PID=$! |
| 88 | + while kill -0 $PID 2>/dev/null; do sleep 3; tail -1 $LOG 2>/dev/null; done |
| 89 | + echo "=== Q${Q} ===" |
| 90 | + grep -E "Q[0-9]+ run" $LOG |
| 91 | +done |
| 92 | +``` |
| 93 | + |
| 94 | +### Kill All Benchmarks |
| 95 | + |
| 96 | +```bash |
| 97 | +pkill -f "run_opensearch.sh"; pkill -f "run_all.sh" |
| 98 | +``` |
| 99 | + |
| 100 | +## Query Numbering (CRITICAL) |
| 101 | + |
| 102 | +| Context | Indexing | "Q17" means | |
| 103 | +|---------|----------|-------------| |
| 104 | +| `--query N` in scripts | 1-based | `--query 18` for Q17 | |
| 105 | +| `queries_trino.sql` line | 1-based | line 18 for Q17 | |
| 106 | +| JSON `result[N]` | 0-based | `result[17]` for Q17 | |
| 107 | + |
| 108 | +**Mnemonic**: scripts and SQL are 1-based, JSON is 0-based. |
| 109 | + |
| 110 | +## Pitfalls |
| 111 | + |
| 112 | +- **NEVER** run `reload-plugin` while a benchmark is running |
| 113 | +- Benchmark on 100M (`hits`), correctness on 1M (`hits_1m`) |
| 114 | +- Use ClickHouse-Parquet baseline, NOT native MergeTree |
| 115 | +- Baseline file: `benchmarks/clickbench/results/performance/clickhouse_parquet_official/c6a.4xlarge.json` |
| 116 | +- OpenSearch endpoint: `http://localhost:9200`, DQE: `POST /_plugins/_trino_sql` |
| 117 | + |
| 118 | +## Current State (2026-03-26) |
| 119 | + |
| 120 | +- Correctness: 29/43 on 1M |
| 121 | +- Within 2x of CH-Parquet: 19/43 on r5.4xlarge (was 16/43 on m5.8xlarge before optimization) |
| 122 | +- Hybrid bitset/collector optimization deployed (selective filters use bitset, broad use Collector) |
| 123 | +- Bitset path: `Weight.count()` estimates selectivity; <50% of docs → bitset, else → Collector |
| 124 | +- Big wins: Q18(0.01x), Q39(1.1x), Q41(0.28x), Q42(1.02x), Q43(0.70x) |
| 125 | +- Target: >= 32/43 within 2x |
0 commit comments