-
-
Notifications
You must be signed in to change notification settings - Fork 78
Description
Hello,
We are facing an issue after upgrading from Streamiz v1.6.0 to v1.7.0. During a rebalance, our pods running on Kubernetes fail to correctly acquire their new partition assignments and become unresponsive. This behavior did not exist in v1.6.0.
Description
When a rebalance is triggered (e.g., by a pod starting, stopping, or sometimes after a consumer is fenced), the stream thread does not seem to handle the new partition assignment correctly and we have been observing log messages like this:
Partition assignment took (time in ms) Currently assigned active tasks: -> Empty space instead of something like this:
Partition assignment took 00:00:00.0007995 ms. Currently assigned active tasks: 0-6
However, following this, one or more pods enter an unresponsive state. They remain running but stop processing messages entirely, effectively becoming "zombies". It appears as though the task assignment for the revoked/assigned partitions is not completed successfully, leaving the thread in a stalled state.
Configuration used
Streamiz Version: 1.7.0 (upgraded from 1.6.0 where the issue was not present)
Confluent Kafka Version: The one used by Streamiz (2.6.1)
Deployment: Kubernetes (8 pods)
Processing Semantics: At-Least-Once
And part of the most important congif:
var streamConfig = new StreamConfig {
ApplicationId = "x-string", // Example ID
BootstrapServers = "...",
NumStreamThreads = 1,
SecurityProtocol = SecurityProtocol.Ssl,
MaxPollRecords = 100,
MaxPollIntervalMs = 120000,
SessionTimeoutMs = 30000,
HeartbeatIntervalMs = 10000,
PartitionAssignmentStrategy = PartitionAssignmentStrategy.CooperativeSticky,
AutoOffsetReset = AutoOffsetReset.Earliest
};
Stream property:
client.id:
num.stream.threads: 1
default.key.serdes: Streamiz.Kafka.Net.SerDes.ByteArraySerDes
default.value.serdes: Streamiz.Kafka.Net.SerDes.ByteArraySerDes
default.timestamp.extractor: Streamiz.Kafka.Net.Processors.Internal.FailOnInvalidTimestamp
commit.interval.ms: 30000
processing.guarantee: AT_LEAST_ONCE
transaction.timeout: 00:00:10
poll.ms: 100
max.poll.records: 100
max.poll.restoring.records: 1000
max.task.idle.ms: 0
buffered.records.per.partition: 2147483647
follow.metadata: True
replication.factor: -1
windowstore.changelog.additional.retention.ms: 86400000
offset.checkpoint.manager:
metrics.interval.ms: 30000
metrics.recording.level: INFO
log.processing.summary: 00:01:00
expose.librdkafka.stats: False
start.task.delay.ms: 5000
parallel.processing: False
max.degree.of.parallelism: 8
statestore.cache.max.bytes: 5242880
application.id: testid
Client property:
security.protocol: ssl
Consumer property:
max.poll.interval.ms: 120000
enable.auto.commit: False
enable.auto.offset.store: False
allow.auto.create.topics: False
partition.assignment.strategy: cooperative-sticky
auto.offset.reset: earliest
session.timeout.ms: 30000
heartbeat.interval.ms: 10000
Producer property:
allow.auto.create.topics: False
partitioner: murmur2_random
Admin client property:
allow.auto.create.topics: False
Notes
- On some occasions, this rebalance is preceded by a fencing message, but the core issue is the subsequent failure to recover.
Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group.
- We recognize that adjusting configuration values such as MaxPollIntervalMs, SessionTimeoutMs, HeartbeatIntervalMs, or even MaxPollRecords can make rebalances less likely. However, these are only workarounds. The core issue remains: when a rebalance does occur, the partition assignment logic fails, causing pods to become unresponsive.