-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Observed behavior
In version 2.10.20, it looks as if a handful of process crashes can cause NATS Jetstream to forget that a stream ever existed. I've reproduced this with both three and five-node clusters, with replication factor 3 and 5. This occurs even with sync_interval=always, as well as the default two minute sync interval.
This test creates a single Jetstream stream called jepsen-stream and publishes a series of unique values to a single subject (jepsen.0) within it.
After killing a few nats-server processes with kill -9, attempts to publish messages throw 503 No Responders Available For Request, and attempts to subscribe to the subject throw Can't subscribe, [SUB-90007] No matching streams for subject. This persists even when we restart every node and stop killing them. Calling JetStreamManager.getStreamNames() will return an empty list, rather than ["jepsen-stream"]. This state of affairs seems to last indefinitely--here's a test where we waited 10,000 seconds for recovery, and the stream never came back.
You'll find node logs here--nothing obvious is jumping out at me. 20250509T191519.377-0500.zip.
I wanted to check--is this... expected behavior? Am I perhaps holding NATS wrong somehow? You can find the NATS Java code I'm calling here: https://github.com/jepsen-io/nats/blob/9e52d9cf0c5f94d436efbfef9e2f2e1288ad7b0f/src/jepsen/nats/client.clj#L78-L136.
Expected behavior
Jetstream streams should not vanish permanently? The point of Jetstream is that they're supposed to be persistent, right?
Server and client version
Server: 2.10.20
Client: io.nats/jnats "2.21.1"
Host environment
Right now these nodes are running Debian 12, running in 3 or 5-node clusters under LXC.
Steps to reproduce
You can reproduce this by cloning the test suite linked above, at commit 9e52d9, setting up a Jepsen environment, and running lein run test --rate 100 --time-limit 300 --nemesis kill --test-count 10 --sync-interval always. Usually manifests after just a few minutes.
