GC hangs indefinitely on LongRunning #115794

cleardarkz · 2025-05-20T13:08:05Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

Sometimes when running a background worker for a lengthy period of time, it appears that the GC tries to run a collection round, suspends all threads but gets interlocked while trying to free resources

We have experienced this behavior on long running processes, usually a couple of hours or more. I will add that the codebase is using GetAwaiter.Result() extensively, however, it appears that even in this case, there should be some sort of a fallback to when the GC fails to run

I have attached a dump diagram of the latest occurrence. I will be able to provide more logs/dumps as required.

Attempting to reproduce the issue with a sample repo succeeded in behavior but the memory dumps do not look the same, therefor the sample project is not attached to the report

Expected Behavior

Resources allocation exception, process exception, any sort of an error which will resume the .NET internals

Steps To Reproduce

No response

Exceptions (if any)

No exceptions, the process hangs until manually freed (Forceful shutdown)

.NET Version

8.0.409

Anything else?

martincostello · 2025-05-20T13:15:16Z

Looks like this issue should be transferred to dotnet/runtime as it doesn't appear to be specific to ASP.NET Core.

dotnet-policy-service · 2025-05-21T16:04:51Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

mangod9 · 2025-05-21T17:03:37Z

@cleardarkz, can you please provide a repro for this issue or a dump when its deadlocked?

cleardarkz · 2025-05-21T19:54:40Z

@cleardarkz, can you please provide a repro for this issue or a dump when its deadlocked?

Absolutely,

Here's a Google Drive link:
https://drive.google.com/file/d/1_7KcDgkSuqIA9MFpe_kqkxXE3bbdL9b7/view?usp=sharing

mangod9 · 2025-05-21T20:50:19Z

Is this a NativeAOT app? I notice that ThreadSuspension appears to be blocked by this reverse pInvoke stack. @VSadov have you seen anything like this before?

 # Call Site
00 ntdll!ZwWaitForSingleObject
01 KERNELBASE!WaitForSingleObjectEx
02 Cymulate_Agent_Service!Thread::RareDisablePreemptiveGC
03 Cymulate_Agent_Service!JIT_ReversePInvokeEnterRare2
04 Cymulate_Agent_Service!JIT_ReversePInvokeEnterTrackTransitions
05 0x0
06 user32!UserCallWinProcCheckWow
07 user32!DispatchClientMessage
08 user32!__fnINDEVICECHANGE
09 ntdll!KiUserCallbackDispatcherContinue
0a win32u!ZwUserGetMessage
0b user32!GetMessageW
0c 0x0
0d 0x0
0e Cymulate_Agent_Service!CallDescrWorkerInternal
0f Cymulate_Agent_Service!CallDescrWorkerWithHandler
10 Cymulate_Agent_Service!DispatchCallSimple
11 Cymulate_Agent_Service!ThreadNative::KickOffThread_Worker
12 Cymulate_Agent_Service!ManagedThreadBase_DispatchInner
13 Cymulate_Agent_Service!ManagedThreadBase_DispatchMiddle
14 Cymulate_Agent_Service!ManagedThreadBase_DispatchOuter
15 Cymulate_Agent_Service!ManagedThreadBase_FullTransition
16 Cymulate_Agent_Service!ManagedThreadBase::KickOff
17 Cymulate_Agent_Service!ThreadNative::KickOffThread
18 kernel32!BaseThreadInitThunk
19 ntdll!RtlUserThreadStart

VSadov · 2025-05-21T22:54:19Z

Is this a NativeAOT app? I notice that ThreadSuspension appears to be blocked by this reverse pInvoke stack. @VSadov have you seen anything like this before?

This is a normal stack for a thread that is trying to enter managed code while EE is suspended, most likely for GC. In such case the thread will block and wait for EE to resume.

VSadov · 2025-05-21T23:27:29Z

From the screenshot it looks like

it is CoreCLR, not NativeAot (because PulseAll is in native code)
one thread tries to initiate GC, thus it starts suspending all managed threads
almost all threads are suspended, but not all.
the suspending thread is waiting (with timeout) for more progress. If the wait times out it will try checking if threads are all stopped or can be forced to stop via hijacking, interrupts etc..
Ultimately it is possible for a thread to be in a state where it cannot be suspended. Such states should be transient and should not take long.
one thread is running ObjectNative::PulseAll. If it runs in COOP mode it would be in a "can't suspend" state and would prevent EE suspend to complete.

If the PulseAll thread is somehow stuck, EE cannot finish suspending and GC cannot start.

I am not sure how PulseAll can get stuck though - it is just going through a linked list of waiters and sets every waiter's event.

The waiter list would be limited by the number of threads in the program, so should not be long. Perhaps it gets corrupted somehow and became circular?
Or maybe the Set on the event gets stuck - not sure if that can happen.

cymulateagentteam · 2025-05-22T10:45:01Z

@VSadov I have reproduced the issue and debugged it live.

The singly linked list had somehow turned into a circular list, which is definitely an issue.
The memory hasn't been corrupted, the link->next value pointed to the same node. See the following:

runtime/src/coreclr/vm/syncblk.cpp

Line 240 in 7e7e195

psb->m_Link.m_pNext = pLink->m_pNext;

causing the following PulseAll to effectively deadlock the GC collection procedure

runtime/src/coreclr/vm/syncblk.cpp

Line 2877 in 7e7e195

while ((pWaitEventLink = ThreadQueue::DequeueThread(this)) != NULL)

I have managed to unclog the deadlock in runtime by manually resetting the ->m_pNext value to NULL, restoring the SList expected behavior

I believe the root cause might be manipulation of the SList from code that's unprotected by a critical section

There could be an easy mitigation for this issue, with the following before L240:

if ( psb->m_Link.m_pNext == pLink->m_pNext )
psb->m_Link.m_pNext = NULL;

… is somehow corrupted into a circular list, causing the engine to enter thread suspension procedure and never exit - Issue dotnet#115794

Fixes dotnet#115794

Fix dotnet#115794

mangod9 · 2025-05-22T14:45:21Z

Is there a standalone repro for this? We should properly investigate and fix root cause instead of taking the proposed fix as-is.

cleardarkz · 2025-05-22T16:44:01Z

Is there a standalone repro for this? We should properly investigate and fix root cause instead of taking the proposed fix as-is.

Investigating the root cause might be a lengthy process as there could be so many reasons why this could happen.

The proposed fix is clean and simple, if the condition in the proposed fix is met, you can safely assume the the process is deadlocked.

Its also worth noting that after releasing the deadlock manually by setting ->m_pNext to NULL, the process resumed gracefully without any further hiccups, as if it was never deadlocked.

jkotas · 2025-05-22T17:10:30Z

We cannot accept changes to silently ignore unexpected data corruptions for security reasons.

It would be ok to check for the data corruption and fail the process immediately. It would replace the hang with a crash.

VSadov · 2025-05-22T19:04:02Z

I think all changes to the list happen when the corresponding lock is acquired, so how could the list become circular?

I wonder if it is possible to insert a circularity check/failfast at places where a waiter is added or removed from the queue and run the repro?

cleardarkz · 2025-05-22T19:31:59Z

We cannot accept changes to silently ignore unexpected data corruptions for security reasons.

It would be ok to check for the data corruption and fail the process immediately. It would replace the hang with a crash.

Understood, we're merely here to provide all the insights we can gather and help find a solution to the issue, if one exists.

jkotas · 2025-05-23T02:04:42Z

Potentially related: #97034 (another mysterious PulseAll issue)

jkotas · 2025-05-23T06:10:39Z

The corrupted linked list lives on thread stacks, ie one thread is modifying stack of another thread. It is unusual data structure. We may be hitting a corner case hardware issue with write-back of stack memory. The linked list corruption can be explained by write-back of stack memory being issued twice that is harmless nearly 100% of the time otherwise.

Here is a delta that we can try to test this theory: jkotas@3b340c1 . Could you please give it a try and let us know if you still hit the hang? InterlockedExchange should give a hint to the hardware to avoid delayed write backs.

cymulateagentteam · 2025-05-25T13:16:55Z

The corrupted linked list lives on thread stacks, ie one thread is modifying stack of another thread. It is unusual data structure. We may be hitting a corner case hardware issue with write-back of stack memory. The linked list corruption can be explained by write-back of stack memory being issued twice that is harmless nearly 100% of the time otherwise.

Here is a delta that we can try to test this theory: jkotas@3b340c1 . Could you please give it a try and let us know if you still hit the hang? InterlockedExchange should give a hint to the hardware to avoid delayed write backs.

I am positive this is going to work, however, I am not entirely sure we can definitely pin this over hardware-caused delayed write-backs.
By assuming this theory is correct and the issue is write-back of stack memory, we simply cannot trust anything to be consistent in the operating system.

Also, this issue has been reproducing specifically with this list way too many times for it to be statistically probable to suffer specifically from delayed write-backs. (although I do believe everything is possible)

If I had to guess, I would lean towards insufficient locking mechanism (incorrect lock objects? missing lock?) or uninitialized fields that creates a once-in-a-while race condition status where the list may be corrupted provided the right stars align

I think performing InterlockedExchange would solve insufficient locking in this case as well as solving delayed write-backs.
I suppose we could give it a go, however, the issue reproduces once every couple of days at best, sometimes it takes weeks.

Another insight - by assuming delayed write-backs could happen so relatively often, I would expect a LOT of processes to break over this behavior. For instance, Windows uses Critical Sections in the NT Loader mechanism to synchronize linked lists across threads, for proper modules initialization and usage, these kinds of swaps in linked lists are very popular in the NT loader implementation (fundamentally the back-bone of Windows) I would expect the lists in the NT Loader to corrupt way before the lists are corrupted in the process' .NET runtime, I would bet a single .NET process' GC collector runs less frequently than the entire NT Loader implementation in every Windows process

jkotas · 2025-05-25T14:10:59Z

Windows uses Critical Sections in the NT Loader mechanism to synchronize linked lists across threads, for proper modules initialization and usage, these kinds of swaps in linked lists are very popular in the NT loader implementation (fundamentally the back-bone of Windows)

This linked list is not an ordinary linked list accessed by multiple threads. Ordinary linked lists accessed by multiple threads are allocated on heap.

This linked list is allocated on stacks of multiple threads. One thread has pointers into stacks of other threads and uses those pointers to read and write stack memory on other threads. This is very unusual. I cannot think of other examples where code uses a data structure like that. Are you aware of any (e.g. in NT loader)?

I would lean towards insufficient locking mechanism (incorrect lock objects? missing lock?) or uninitialized fields that creates a once-in-a-while race condition status

I would lean towards this as well. However, I am not able to find evidence of anything like that in the crash dumps.

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 20, 2025

BrennanConroy transferred this issue from dotnet/aspnetcore May 20, 2025

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 20, 2025

jeffschwMSFT added the area-GC-coreclr label May 21, 2025

cymulateagentteam added a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025

EE - Cut circularity of singly linked list to mitigate deadlock

91bdfde

Fixes dotnet#115794

cymulateagentteam added a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025

EE - Cut circularity of singly linked list to mitigate deadlock

251cf79

Fix dotnet#115794

cymulateagentteam added a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025

EE - Cut circularity of singly linked list to mitigate deadlock

3f7acd3

Fix dotnet#115794

cymulateagentteam linked a pull request May 22, 2025 that will close this issue

EE - Fix deadlock on thread suspension by GC collection (#115794) #115892

Draft

GC hangs indefinitely on LongRunning #115794

GC hangs indefinitely on LongRunning #115794

Comments

cleardarkz commented May 20, 2025

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps To Reproduce

Exceptions (if any)

.NET Version

Anything else?

martincostello commented May 20, 2025

Uh oh!

dotnet-policy-service bot commented May 21, 2025

Uh oh!

mangod9 commented May 21, 2025

Uh oh!

cleardarkz commented May 21, 2025

Uh oh!

mangod9 commented May 21, 2025

Uh oh!

VSadov commented May 21, 2025

Uh oh!

VSadov commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cymulateagentteam commented May 22, 2025

Uh oh!

mangod9 commented May 22, 2025

Uh oh!

cleardarkz commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented May 22, 2025

Uh oh!

VSadov commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cleardarkz commented May 22, 2025

Uh oh!

jkotas commented May 23, 2025

Uh oh!

jkotas commented May 23, 2025

Uh oh!

cymulateagentteam commented May 25, 2025

Uh oh!

jkotas commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented May 21, 2025 •

edited

Loading

cleardarkz commented May 22, 2025 •

edited

Loading

VSadov commented May 22, 2025 •

edited

Loading

jkotas commented May 25, 2025 •

edited

Loading