Skip to content

JetStream stream lost when processes are killed and restarted [v2.10.20] #6888

@aphyr

Description

@aphyr

Observed behavior

In version 2.10.20, it looks as if a handful of process crashes can cause NATS Jetstream to forget that a stream ever existed. I've reproduced this with both three and five-node clusters, with replication factor 3 and 5. This occurs even with sync_interval=always, as well as the default two minute sync interval.

This test creates a single Jetstream stream called jepsen-stream and publishes a series of unique values to a single subject (jepsen.0) within it.

After killing a few nats-server processes with kill -9, attempts to publish messages throw 503 No Responders Available For Request, and attempts to subscribe to the subject throw Can't subscribe, [SUB-90007] No matching streams for subject. This persists even when we restart every node and stop killing them. Calling JetStreamManager.getStreamNames() will return an empty list, rather than ["jepsen-stream"]. This state of affairs seems to last indefinitely--here's a test where we waited 10,000 seconds for recovery, and the stream never came back.

Image

You'll find node logs here--nothing obvious is jumping out at me. 20250509T191519.377-0500.zip.

I wanted to check--is this... expected behavior? Am I perhaps holding NATS wrong somehow? You can find the NATS Java code I'm calling here: https://github.com/jepsen-io/nats/blob/9e52d9cf0c5f94d436efbfef9e2f2e1288ad7b0f/src/jepsen/nats/client.clj#L78-L136.

Expected behavior

Jetstream streams should not vanish permanently? The point of Jetstream is that they're supposed to be persistent, right?

Server and client version

Server: 2.10.20
Client: io.nats/jnats "2.21.1"

Host environment

Right now these nodes are running Debian 12, running in 3 or 5-node clusters under LXC.

Steps to reproduce

You can reproduce this by cloning the test suite linked above, at commit 9e52d9, setting up a Jepsen environment, and running lein run test --rate 100 --time-limit 300 --nemesis kill --test-count 10 --sync-interval always. Usually manifests after just a few minutes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    defectSuspected defect such as a bug or regressionstaleThis issue has had no activity in a while

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions