Skip to content

test(red-197205): regression test for zero-rate stats with --reconnect-interval#384

Merged
fcostaoliveira merged 4 commits into
redis:masterfrom
filipecosta90:test/red-197205-reconnect-interval-rates
May 27, 2026
Merged

test(red-197205): regression test for zero-rate stats with --reconnect-interval#384
fcostaoliveira merged 4 commits into
redis:masterfrom
filipecosta90:test/red-197205-reconnect-interval-rates

Conversation

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

@fcostaoliveira fcostaoliveira commented May 17, 2026

Summary

  • Pins down the RED-197205 invariant: running with `--reconnect-interval=1` must not zero out `Ops/sec`, `KB/sec`, `Hits/sec` or `Misses/sec` in the final JSON summary.
  • The bug itself was already fixed on master by 0228bac (PR Add client staircase ramp-up for incremental connection growth #350Add client staircase ramp-up) as a side-effect of the `m_started` atomic flag added to skip never-started clients in `merge_run_stats()` / `aggregate_inst_histogram()`. RED-197205 was never called out by name in that PR, so this test makes the invariant explicit.

Why this matters

Before PR #350, `client_group::merge_run_stats()` iterated all clients including those that were `prepare()`-d but never reached `set_start_time()`. Their zeroed `m_start_time` was factorial-averaged into the aggregate, pulling it toward the epoch. With `m_end_time` correct but `m_start_time` near zero, `test_duration_usec` became huge and every `ops / test_duration_usec * 1000000` division collapsed to ~0. The latency fields were correct because they're computed independently of `test_duration_usec` (`accumulated_latency / count`), which is exactly the asymmetry the bug report observed.

Verification

Locally reproduced + confirmed the fix using the test's exact workload (`-t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1`):

Binary Totals.Ops/sec Totals.KB/sec Totals.Misses/sec Totals.Latency
v2.2.1 0.00 0.00 0.00 0.249
master 52320 3035 26114 0.249

The test would have caught the bug on v2.2.1; master passes cleanly.

Test-shape notes

  • The bug only fires with `--test-time` (not `-n`). With `-n`, every client runs to completion and reaches `set_start_time`, so the zero-state merge path is never hit. `_build_benchmark` gains an optional `test_time` parameter that's mutually exclusive with `requests` (memtier rejects both).
  • Test runtime: ~5 s + harness overhead.

Test plan

  • Test passes on master.
  • Test would have failed on v2.2.1 (`Ops/sec=0.00`).
  • `make format-check-staged` (pre-commit hook) passes.
  • Full RLTest matrix on CI.

🤖 Generated with Claude Code


Note

Low Risk
Test-only changes to RLTest JSON integrity checks; no production memtier or stats-merge code is modified.

Overview
Adds RED-197205 coverage so JSON summaries with --reconnect-interval=1 and --test-time keep non-zero Ops/sec, KB/sec, and Misses/sec in ALL STATS (and per-command rates when counts are positive). The workload matches the original bug (-t 4 -c 4, ratio 1:1); cluster runs are skipped because memtier disallows reconnect-interval in cluster mode.

_build_benchmark now accepts optional test_time, which clears requests so memtier is not given conflicting limits.

Time-series latency checks add 1.0 / count to the tolerance so 1 ms rounding on Accumulated Latency does not flake on low-count buckets (e.g. the last second of a short timed run with reconnect).

Reviewed by Cursor Bugbot for commit a3c4b4d. Bugbot is set up for automated code reviews on this repo. Configure here.

paulorsousa
paulorsousa previously approved these changes May 17, 2026
fcostaoliveira added a commit to filipecosta90/memtier_benchmark that referenced this pull request May 18, 2026
…ncy quantization

CI on PR redis#384 surfaced this: with --reconnect-interval=1 and --test-time=5,
the trailing per-second bucket can hold very few ops (Count as low as 1).
'Accumulated Latency' is emitted with %lld (1 ms precision), so the
reported value can differ from the true sum by up to 1 ms — which means
the derived avg = accumulated/count can differ from the double-precision
'Average Latency' by up to (1 / count) ms.

The previous tolerance was max(0.01 ms, 2% relative). For Count=1 with a
sub-ms avg that's 0.01 ms; observed gap was 0.058 ms (0.391 vs 0.333),
well within the quantization bound but outside the static tolerance.

Add a third term `1.0 / count` so the quantization error is modeled
explicitly. For large-count buckets (the typical case in every existing
test) the existing tolerance still dominates, so no behavior change for
them.

Verified by replaying the exact failing workload locally:
  -t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1
  old tolerance: 3 violations across Sets/Gets/Totals bucket 5
  new tolerance: 0 violations

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fcostaoliveira and others added 3 commits May 18, 2026 12:58
…t-interval

RED-197205 reported that running with --reconnect-interval=1 produced a
final summary where Totals.Ops/sec, KB/sec, Hits/sec and Misses/sec all
rendered as 0.00 even though Count and Latency were correct.

Root cause (on v2.2.1): client_group::merge_run_stats() iterated *all*
clients including those that were prepare()-d but never reached
set_start_time(). Their zeroed m_start_time was factorial-averaged into
the aggregate, dragging it toward the epoch. With m_end_time correct
but m_start_time near zero, test_duration_usec became huge and every
`ops / test_duration_usec * 1000000` division collapsed to ~0.

Fix landed in 0228bac (PR redis#350): a run_stats::m_started atomic flag
plus has_started() guards in merge_run_stats() and
aggregate_inst_histogram() that skip never-started clients. RED-197205
was a side-effect fix, not the headline of that PR, so this test pins
down the invariant explicitly.

Verified end-to-end against the test's exact workload:

  v2.2.1 + 4t/4c --test-time=5 --ratio=1:1 --reconnect-interval=1
    -> Ops/sec=0.00, KB/sec=0.00, Misses/sec=0.00  (bug fires)

  master + same args
    -> Ops/sec=52320, KB/sec=3035, Misses/sec=26114  (fixed)

The bug only fires reliably with --test-time (not -n); with -n every
client runs to completion and reaches set_start_time, so the zero-state
merge path is never hit. _build_benchmark gains an optional test_time
parameter so the helper can express that distinction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ncy quantization

CI on PR redis#384 surfaced this: with --reconnect-interval=1 and --test-time=5,
the trailing per-second bucket can hold very few ops (Count as low as 1).
'Accumulated Latency' is emitted with %lld (1 ms precision), so the
reported value can differ from the true sum by up to 1 ms — which means
the derived avg = accumulated/count can differ from the double-precision
'Average Latency' by up to (1 / count) ms.

The previous tolerance was max(0.01 ms, 2% relative). For Count=1 with a
sub-ms avg that's 0.01 ms; observed gap was 0.058 ms (0.391 vs 0.333),
well within the quantization bound but outside the static tolerance.

Add a third term `1.0 / count` so the quantization error is modeled
explicitly. For large-count buckets (the typical case in every existing
test) the existing tolerance still dominates, so no behavior change for
them.

Verified by replaying the exact failing workload locally:
  -t 4 -c 4 --test-time=5 --ratio=1:1 --reconnect-interval=1
  old tolerance: 3 violations across Sets/Gets/Totals bucket 5
  new tolerance: 0 violations

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
memtier rejects --reconnect-interval together with --cluster-mode
("error: cluster mode dose not support reconnect-interval option"),
so the test exited non-zero on every cluster CI job. The RED-197205
regression is not cluster-specific — guard the test with env.isCluster()
so it only runs in single-endpoint configurations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fcostaoliveira fcostaoliveira force-pushed the test/red-197205-reconnect-interval-rates branch from 930f554 to b90142c Compare May 18, 2026 12:00
@fcostaoliveira
Copy link
Copy Markdown
Collaborator Author

CI is fully green (65/65 checks, commit a3c4b4d). Ready for approval @paulorsousa

@fcostaoliveira fcostaoliveira merged commit 324553f into redis:master May 27, 2026
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants