KAFKA-19317 : Refactor ShareConsumerTest::waitedPoll to work with multiple records. #19789

ShivsundarR · 2025-05-22T18:59:08Z

What
https://issues.apache.org/jira/browse/KAFKA-19317

One of the test failed due to a possible race condition
inwaitedPoll() where we expect 2 records and we get only 1 record on
the first poll(). This record wasn't acknowledged before the next
poll()
which is not allowed when share.acknowledgement.mode is set to
"explicit". Hence the IllegalStateException was thrown.
To fix this, I have refactored the test to produce and consume 1 record
each in
succession as that is more deterministic.
On digging into the reason for this flakiness, I noticed there might
be
a race condition in waitedPoll() where we might get some records on
the first poll() and some on later calls.
waitedPoll() does not cumulatively add up the records received across
the different polls, it
retries until one poll() gives the exact number of records that it
expects.
So when we expect more than 1 record in waitedPoll, then there is a
chance of
records getting split across polls ifShareFetchBuffer does not have
all the records yet.
I have added a separate function which will cumulate the records across
the polls and
return the result and used this for tests which were calling
waitedPoll() expecting multiple records.
Some of these modified tests are using "explicit" mode, so we should
ideally refactor these tests too to expect 1 only record at a time.
These tests have clean runs in develocity though, so I have not modified
these tests in this PR. We can modify them if we observe flakiness for
them in future.

apoorvmittal10

Thanks for the PR, minor comments.

apoorvmittal10 · 2025-05-23T08:37:36Z

...nts-integration-tests/src/test/java/org/apache/kafka/clients/consumer/ShareConsumerTest.java

+                                                                     long pollMs,
+                                                                     int recordCount) {


nit: needs indentation correction.

apoorvmittal10 · 2025-05-23T08:37:58Z

...nts-integration-tests/src/test/java/org/apache/kafka/clients/consumer/ShareConsumerTest.java

@@ -2977,6 +2984,28 @@ private ConsumerRecords<byte[], byte[]> waitedPoll(
        return waitedPoll(shareConsumer, pollMs, recordCount, false, "", List.of());
    }

+    private List<ConsumerRecord<byte[], byte[]>> waitedPollMultipleRecords(ShareConsumer<byte[], byte[]> shareConsumer,


Suggested change

private List<ConsumerRecord<byte[], byte[]>> waitedPollMultipleRecords(ShareConsumer<byte[], byte[]> shareConsumer,

private List<ConsumerRecord<byte[], byte[]>> waitedPollForMultipleRecords(ShareConsumer<byte[], byte[]> shareConsumer,

apoorvmittal10 · 2025-05-23T08:38:55Z

...nts-integration-tests/src/test/java/org/apache/kafka/clients/consumer/ShareConsumerTest.java

+                },
+                DEFAULT_MAX_WAIT_MS,
+                500L,
+                () -> "failed to get records"


Shall we log how many records recieved vs needed? It will be easier to debug in future?

Makes sense, I have changed the log line now. Thanks.

apoorvmittal10 · 2025-05-23T10:18:42Z

@ShivsundarR Please avoid force pushing once the PR is under review and has previous comments.

ShivsundarR · 2025-05-23T10:28:14Z

Hi @apoorvmittal10 , yes apologies, I was trying to fix the build failure by rebasing with master, and then realised should have added a merge commit.
It seems build failure is addressed in this PR - #19792.
I will add a merge commit once this PR is merged, that should fix the build failure.

apoorvmittal10 · 2025-05-23T10:32:18Z

One of the test failed due to a possible race condition
inwaitedPoll() where we expect 2 records and we get only 1 record on
the first poll(). This record wasn't acknowledged before the next
poll()

Are we not doing producer.flush preior reading, Then why still we get subset of records in poll?

I have added a separate function which will cumulate the records across

Whynot to change the implementation of existing waitdPoll method?

ShivsundarR · 2025-05-23T10:48:35Z

We are doing a producer.flush, I probably expect its a race condition when the ShareFetchBuffer where both background and application thread are writing and reading from once the response is received.

java.lang.IllegalStateException: All records must be acknowledged in explicit acknowledgement mode. 2025-05-17T16:09:18.3314782Z at

As we got this exception for the test, it means the waitedPoll is getting partial records across multiple polls, which have not been acknowledged in explicit mode. Hence the exception is thrown. So I deduced that for multiple records, waitedPoll() might not be deterministic as it expects all records in a single go.

Whynot to change the implementation of existing waitedPoll method?

We could modify the existing function, I thought to have waitedPoll() maybe only for checking 1 record, and then a separate implementation to check for multiple records. As most of the tests in the suite only check for 1 record, they can use a straightforward implementation in waitedPoll. Does that sound good?

ShivsundarR · 2025-05-23T10:53:41Z

Some of these modified tests are using "explicit" mode, so we should
ideally refactor these tests too to expect 1 only record at a time.
These tests have clean runs in develocity though, so I have not modified
these tests in this PR. We can modify them if we observe flakiness for
them in future.

There is also this problem where there are a few other tests which expect more than 1 record in a poll using explicit mode. These are are not flaky as of now, but we might need to modify them to test 1 record at a time to be deterministic.
If we introduce AcknowledgeType.RETAIN or change the default way to not throw an exception in future, then these tests are good, but for now we can probably monitor and change if required.

apoorvmittal10 · 2025-05-23T11:00:13Z

I thought to have waitedPoll() maybe only for checking 1 record

The waitdPoll method also takes recordsCount, so should it not be needed now?

ShivsundarR · 2025-05-23T11:08:12Z

Yes it should not be needed now, I will update the code. Thanks.

ShivsundarR · 2025-05-27T12:42:37Z

I noticed @adixitconfluent 's PR here which adds a future.get() to produceAbortedTransaction(). This could explain the flakiness in testAlterReadUncommittedToReadCommittedIsolationLevel as well. If the abort transaction does not complete in time, then we could possibly see these records in a different poll().
So that PR should fix the problem and we would get both the records on a single poll(). It is now merged.
This also explains why the other tests which use explicit mode and expect multiple records are not flaky.

Initially I thought this could be a race condition in the client, but it seems unlikely now. We can observe the builds on AK to see if the flakiness persists and then we can close this PR if it is resolved. We can keep the PR open till then.
The refactors in waitedPoll() should also not be required as we would ideally get all the produced records in a single poll().

github-actions bot added triage PRs from the community tests Test fixes (including flaky tests) clients small Small PRs labels May 22, 2025

ShivsundarR added ci-approved tests Test fixes (including flaky tests) KIP-932 Queues for Kafka clients and removed tests Test fixes (including flaky tests) clients small Small PRs triage PRs from the community labels May 22, 2025

ShivsundarR requested a review from AndrewJSchofield May 22, 2025 19:02

apoorvmittal10 reviewed May 23, 2025

View reviewed changes

github-actions bot added the small Small PRs label May 23, 2025

ShivsundarR added 2 commits May 23, 2025 15:38

Refactor ShareConsumertest waitedPoll

075288c

Address review comments

8341759

ShivsundarR force-pushed the KAFKA-19317 branch from db9a6f2 to 8341759 Compare May 23, 2025 10:10

Address review comments II

1f9f06f

github-actions bot removed the small Small PRs label May 23, 2025

ShivsundarR added 3 commits May 23, 2025 17:48

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19317

5539faa

Update method name

f1a84f0

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19317

775c9ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19317 : Refactor ShareConsumerTest::waitedPoll to work with multiple records. #19789

KAFKA-19317 : Refactor ShareConsumerTest::waitedPoll to work with multiple records. #19789

Uh oh!

ShivsundarR commented May 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

apoorvmittal10 left a comment

Uh oh!

apoorvmittal10 May 23, 2025

Uh oh!

apoorvmittal10 May 23, 2025

Uh oh!

apoorvmittal10 May 23, 2025

Uh oh!

ShivsundarR May 23, 2025

Uh oh!

apoorvmittal10 commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025 •

edited

Loading

Uh oh!

apoorvmittal10 commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025 •

edited

Loading

Uh oh!

apoorvmittal10 commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025 •

edited

Loading

Uh oh!

ShivsundarR commented May 27, 2025

Uh oh!

Uh oh!

	private List<ConsumerRecord<byte[], byte[]>> waitedPollMultipleRecords(ShareConsumer<byte[], byte[]> shareConsumer,
	private List<ConsumerRecord<byte[], byte[]>> waitedPollForMultipleRecords(ShareConsumer<byte[], byte[]> shareConsumer,

KAFKA-19317 : Refactor ShareConsumerTest::waitedPoll to work with multiple records. #19789

Are you sure you want to change the base?

KAFKA-19317 : Refactor ShareConsumerTest::waitedPoll to work with multiple records. #19789

Uh oh!

Conversation

ShivsundarR commented May 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 left a comment

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

ShivsundarR May 23, 2025

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 commented May 23, 2025

Uh oh!

ShivsundarR commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShivsundarR commented May 27, 2025

Uh oh!

Uh oh!

ShivsundarR commented May 22, 2025 •

edited by github-actions bot

Loading

ShivsundarR commented May 23, 2025 •

edited

Loading

ShivsundarR commented May 23, 2025 •

edited

Loading

ShivsundarR commented May 23, 2025 •

edited

Loading