test(red-197205): regression test for zero-rate stats with --reconnect-interval by fcostaoliveira · Pull Request #384 · redis/memtier_benchmark

fcostaoliveira · 2026-05-17T15:35:45Z

Summary

Pins down the RED-197205 invariant: running with `--reconnect-interval=1` must not zero out `Ops/sec`, `KB/sec`, `Hits/sec` or `Misses/sec` in the final JSON summary.
The bug itself was already fixed on master by 0228bac (PR Add client staircase ramp-up for incremental connection growth #350 — Add client staircase ramp-up) as a side-effect of the `m_started` atomic flag added to skip never-started clients in `merge_run_stats()` / `aggregate_inst_histogram()`. RED-197205 was never called out by name in that PR, so this test makes the invariant explicit.

Why this matters

Before PR #350, `client_group::merge_run_stats()` iterated all clients including those that were `prepare()`-d but never reached `set_start_time()`. Their zeroed `m_start_time` was factorial-averaged into the aggregate, pulling it toward the epoch. With `m_end_time` correct but `m_start_time` near zero, `test_duration_usec` became huge and every `ops / test_duration_usec * 1000000` division collapsed to ~0. The latency fields were correct because they're computed independently of `test_duration_usec` (`accumulated_latency / count`), which is exactly the asymmetry the bug report observed.

Verification

Locally reproduced + confirmed the fix using the test's exact workload (`-t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1`):

Binary	Totals.Ops/sec	Totals.KB/sec	Totals.Misses/sec	Totals.Latency
v2.2.1	0.00	0.00	0.00	0.249
master	52320	3035	26114	0.249

The test would have caught the bug on v2.2.1; master passes cleanly.

Test-shape notes

The bug only fires with `--test-time` (not `-n`). With `-n`, every client runs to completion and reaches `set_start_time`, so the zero-state merge path is never hit. `_build_benchmark` gains an optional `test_time` parameter that's mutually exclusive with `requests` (memtier rejects both).
Test runtime: ~5 s + harness overhead.

Test plan

Test passes on master.
Test would have failed on v2.2.1 (`Ops/sec=0.00`).
`make format-check-staged` (pre-commit hook) passes.
Full RLTest matrix on CI.

🤖 Generated with Claude Code

Note

Low Risk
Test-only changes to RLTest JSON integrity checks; no production memtier or stats-merge code is modified.

Overview
Adds RED-197205 coverage so JSON summaries with --reconnect-interval=1 and --test-time keep non-zero Ops/sec, KB/sec, and Misses/sec in ALL STATS (and per-command rates when counts are positive). The workload matches the original bug (-t 4 -c 4, ratio 1:1); cluster runs are skipped because memtier disallows reconnect-interval in cluster mode.

_build_benchmark now accepts optional test_time, which clears requests so memtier is not given conflicting limits.

Time-series latency checks add 1.0 / count to the tolerance so 1 ms rounding on Accumulated Latency does not flake on low-count buckets (e.g. the last second of a short timed run with reconnect).

^{Reviewed by Cursor Bugbot for commit a3c4b4d. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ncy quantization CI on PR redis#384 surfaced this: with --reconnect-interval=1 and --test-time=5, the trailing per-second bucket can hold very few ops (Count as low as 1). 'Accumulated Latency' is emitted with %lld (1 ms precision), so the reported value can differ from the true sum by up to 1 ms — which means the derived avg = accumulated/count can differ from the double-precision 'Average Latency' by up to (1 / count) ms. The previous tolerance was max(0.01 ms, 2% relative). For Count=1 with a sub-ms avg that's 0.01 ms; observed gap was 0.058 ms (0.391 vs 0.333), well within the quantization bound but outside the static tolerance. Add a third term `1.0 / count` so the quantization error is modeled explicitly. For large-count buckets (the typical case in every existing test) the existing tolerance still dominates, so no behavior change for them. Verified by replaying the exact failing workload locally: -t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1 old tolerance: 3 violations across Sets/Gets/Totals bucket 5 new tolerance: 0 violations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-interval RED-197205 reported that running with --reconnect-interval=1 produced a final summary where Totals.Ops/sec, KB/sec, Hits/sec and Misses/sec all rendered as 0.00 even though Count and Latency were correct. Root cause (on v2.2.1): client_group::merge_run_stats() iterated *all* clients including those that were prepare()-d but never reached set_start_time(). Their zeroed m_start_time was factorial-averaged into the aggregate, dragging it toward the epoch. With m_end_time correct but m_start_time near zero, test_duration_usec became huge and every `ops / test_duration_usec * 1000000` division collapsed to ~0. Fix landed in 0228bac (PR redis#350): a run_stats::m_started atomic flag plus has_started() guards in merge_run_stats() and aggregate_inst_histogram() that skip never-started clients. RED-197205 was a side-effect fix, not the headline of that PR, so this test pins down the invariant explicitly. Verified end-to-end against the test's exact workload: v2.2.1 + 4t/4c --test-time=5 --ratio=1:1 --reconnect-interval=1 -> Ops/sec=0.00, KB/sec=0.00, Misses/sec=0.00 (bug fires) master + same args -> Ops/sec=52320, KB/sec=3035, Misses/sec=26114 (fixed) The bug only fires reliably with --test-time (not -n); with -n every client runs to completion and reaches set_start_time, so the zero-state merge path is never hit. _build_benchmark gains an optional test_time parameter so the helper can express that distinction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ncy quantization CI on PR redis#384 surfaced this: with --reconnect-interval=1 and --test-time=5, the trailing per-second bucket can hold very few ops (Count as low as 1). 'Accumulated Latency' is emitted with %lld (1 ms precision), so the reported value can differ from the true sum by up to 1 ms — which means the derived avg = accumulated/count can differ from the double-precision 'Average Latency' by up to (1 / count) ms. The previous tolerance was max(0.01 ms, 2% relative). For Count=1 with a sub-ms avg that's 0.01 ms; observed gap was 0.058 ms (0.391 vs 0.333), well within the quantization bound but outside the static tolerance. Add a third term `1.0 / count` so the quantization error is modeled explicitly. For large-count buckets (the typical case in every existing test) the existing tolerance still dominates, so no behavior change for them. Verified by replaying the exact failing workload locally: -t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1 old tolerance: 3 violations across Sets/Gets/Totals bucket 5 new tolerance: 0 violations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

memtier rejects --reconnect-interval together with --cluster-mode ("error: cluster mode dose not support reconnect-interval option"), so the test exited non-zero on every cluster CI job. The RED-197205 regression is not cluster-specific — guard the test with env.isCluster() so it only runs in single-endpoint configurations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fcostaoliveira · 2026-05-27T23:12:14Z

CI is fully green (65/65 checks, commit a3c4b4d). Ready for approval @paulorsousa

fcostaoliveira requested a review from paulorsousa May 17, 2026 15:35

paulorsousa previously approved these changes May 17, 2026

View reviewed changes

fcostaoliveira dismissed paulorsousa’s stale review via dd2f718 May 18, 2026 08:45

fcostaoliveira requested review from paulorsousa May 18, 2026 09:37

fcostaoliveira and others added 3 commits May 18, 2026 12:58

fcostaoliveira force-pushed the test/red-197205-reconnect-interval-rates branch from 930f554 to b90142c Compare May 18, 2026 12:00

ci: retrigger CI run after timeout fluke

a3c4b4d

fcostaoliveira merged commit 324553f into redis:master May 27, 2026
65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(red-197205): regression test for zero-rate stats with --reconnect-interval#384

test(red-197205): regression test for zero-rate stats with --reconnect-interval#384
fcostaoliveira merged 4 commits into
redis:masterfrom
filipecosta90:test/red-197205-reconnect-interval-rates

fcostaoliveira commented May 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

fcostaoliveira commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fcostaoliveira commented May 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Verification

Test-shape notes

Test plan

Uh oh!

fcostaoliveira commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fcostaoliveira commented May 17, 2026 •

edited by cursor Bot

Loading