Skip to content

v1.7.0: Pods become unresponsive ('zombies') after a rebalance with CooperativeSticky strategy #435

@GuillermoTunjoS

Description

@GuillermoTunjoS

Hello,

We are facing an issue after upgrading from Streamiz v1.6.0 to v1.7.0. During a rebalance, our pods running on Kubernetes fail to correctly acquire their new partition assignments and become unresponsive. This behavior did not exist in v1.6.0.

Description

When a rebalance is triggered (e.g., by a pod starting, stopping, or sometimes after a consumer is fenced), the stream thread does not seem to handle the new partition assignment correctly and we have been observing log messages like this:

Partition assignment took (time in ms) Currently assigned active tasks: -> Empty space instead of something like this:

Partition assignment took 00:00:00.0007995 ms. Currently assigned active tasks: 0-6

However, following this, one or more pods enter an unresponsive state. They remain running but stop processing messages entirely, effectively becoming "zombies". It appears as though the task assignment for the revoked/assigned partitions is not completed successfully, leaving the thread in a stalled state.

Configuration used

Streamiz Version: 1.7.0 (upgraded from 1.6.0 where the issue was not present)
Confluent Kafka Version: The one used by Streamiz (2.6.1)
Deployment: Kubernetes (8 pods)
Processing Semantics: At-Least-Once

And part of the most important congif:

var streamConfig = new StreamConfig {
    ApplicationId = "x-string", // Example ID
    BootstrapServers = "...",
    NumStreamThreads = 1,
    SecurityProtocol = SecurityProtocol.Ssl,
    MaxPollRecords = 100,
    MaxPollIntervalMs = 120000,
    SessionTimeoutMs = 30000,
    HeartbeatIntervalMs = 10000,
    PartitionAssignmentStrategy = PartitionAssignmentStrategy.CooperativeSticky,
    AutoOffsetReset = AutoOffsetReset.Earliest
};
Stream property:
		client.id: 	
		num.stream.threads: 	1
		default.key.serdes: 	Streamiz.Kafka.Net.SerDes.ByteArraySerDes
		default.value.serdes: 	Streamiz.Kafka.Net.SerDes.ByteArraySerDes
		default.timestamp.extractor: 	Streamiz.Kafka.Net.Processors.Internal.FailOnInvalidTimestamp
		commit.interval.ms: 	30000
		processing.guarantee: 	AT_LEAST_ONCE
		transaction.timeout: 	00:00:10
		poll.ms: 	100
		max.poll.records: 	100
		max.poll.restoring.records: 	1000
		max.task.idle.ms: 	0
		buffered.records.per.partition: 	2147483647
		follow.metadata: 	True
		replication.factor: 	-1
		windowstore.changelog.additional.retention.ms: 	86400000
		offset.checkpoint.manager: 	
		metrics.interval.ms: 	30000
		metrics.recording.level: 	INFO
		log.processing.summary: 	00:01:00
		expose.librdkafka.stats: 	False
		start.task.delay.ms: 	5000
		parallel.processing: 	False
		max.degree.of.parallelism: 	8
		statestore.cache.max.bytes: 	5242880
		application.id: testid
	Client property:
		security.protocol: 	ssl
	Consumer property:
		max.poll.interval.ms: 	120000
		enable.auto.commit: 	False
		enable.auto.offset.store: 	False
		allow.auto.create.topics: 	False
		partition.assignment.strategy: 	cooperative-sticky
		auto.offset.reset: 	earliest
		session.timeout.ms: 	30000
		heartbeat.interval.ms: 	10000
	Producer property:
		allow.auto.create.topics: 	False
		partitioner: 	murmur2_random
	Admin client property:
		allow.auto.create.topics: 	False

Notes

  1. On some occasions, this rebalance is preceded by a fencing message, but the core issue is the subsequent failure to recover.

Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group.

  1. We recognize that adjusting configuration values such as MaxPollIntervalMs, SessionTimeoutMs, HeartbeatIntervalMs, or even MaxPollRecords can make rebalances less likely. However, these are only workarounds. The core issue remains: when a rebalance does occur, the partition assignment logic fails, causing pods to become unresponsive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions