Summary
When a fragment is deleted by retention while being read, try_read/3 sets current_not_found = true and maybe_start_request calls cancel_requests/1 to cancel all in-flight S3 requests. gun:cancel is called for each request, but gun still delivers any data that was already buffered or in-flight on the TCP connection. These arrive as gun_data messages whose StreamRef is no longer in the requests map, so match_async returns error and each message is logged as an unexpected message and discarded.
Under high load, this produces a flood of hundreds of stale gun_data messages in the remote reader's mailbox. The remote reader processes them one at a time in handle_info, and while doing so it cannot process the gun responses for the new request it issued after the jump. If the backlog is large enough, the new request's response is delayed past the 30-second request timeout, which cancels and re-issues the request - potentially triggering another flood. Eventually rabbit_stream_reader times out on its 60-second gen_server:call to the remote reader and crashes the consumer connection.
Evidence
From a high-throughput test run (6 streams, 12 producers, 12 consumers):
15:51:43 [warning] Fragment 4810243 was deleted by retention while being read.
Jumping to oldest available fragment 5533579
15:51:43 - 15:52:17 235 "received unexpected message" log entries
15:52:17 [warning] S3 request timed out on <conn>/<stream_ref> ← 30s timeout fires
15:52:43 [debug] rabbit_stream_reader terminating ... {timeout, {gen_server, call, [..., 60000]}}
The 30-second request timeout (added in PR #121) fires correctly, but the mailbox backlog prevents the re-issued request from being processed in time.
Instance type
m7g.large
Fix direction
The error branch in handle_info processes unexpected messages one at a time. Options:
- Drain all pending unexpected messages in a single
handle_info pass before returning, rather than processing them one at a time.
- Suppress the debug log for stale
gun_data messages to reduce overhead (the log itself adds latency under flood conditions).
- Track cancelled
StreamRefs and explicitly ignore their responses without going through match_async.
Summary
When a fragment is deleted by retention while being read,
try_read/3setscurrent_not_found = trueandmaybe_start_requestcallscancel_requests/1to cancel all in-flight S3 requests.gun:cancelis called for each request, but gun still delivers any data that was already buffered or in-flight on the TCP connection. These arrive asgun_datamessages whoseStreamRefis no longer in therequestsmap, somatch_asyncreturnserrorand each message is logged as an unexpected message and discarded.Under high load, this produces a flood of hundreds of stale
gun_datamessages in the remote reader's mailbox. The remote reader processes them one at a time inhandle_info, and while doing so it cannot process the gun responses for the new request it issued after the jump. If the backlog is large enough, the new request's response is delayed past the 30-second request timeout, which cancels and re-issues the request - potentially triggering another flood. Eventuallyrabbit_stream_readertimes out on its 60-secondgen_server:callto the remote reader and crashes the consumer connection.Evidence
From a high-throughput test run (6 streams, 12 producers, 12 consumers):
The 30-second request timeout (added in PR #121) fires correctly, but the mailbox backlog prevents the re-issued request from being processed in time.
Instance type
m7g.largeFix direction
The
errorbranch inhandle_infoprocesses unexpected messages one at a time. Options:handle_infopass before returning, rather than processing them one at a time.gun_datamessages to reduce overhead (the log itself adds latency under flood conditions).StreamRefs and explicitly ignore their responses without going throughmatch_async.