Skip to content

GC hangs indefinitely on LongRunning #115794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
cleardarkz opened this issue May 20, 2025 · 17 comments · May be fixed by #115892
Open
1 task done

GC hangs indefinitely on LongRunning #115794

cleardarkz opened this issue May 20, 2025 · 17 comments · May be fixed by #115892
Labels
area-GC-coreclr needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners untriaged New issue has not been triaged by the area owner

Comments

@cleardarkz
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

Sometimes when running a background worker for a lengthy period of time, it appears that the GC tries to run a collection round, suspends all threads but gets interlocked while trying to free resources

We have experienced this behavior on long running processes, usually a couple of hours or more. I will add that the codebase is using GetAwaiter.Result() extensively, however, it appears that even in this case, there should be some sort of a fallback to when the GC fails to run

I have attached a dump diagram of the latest occurrence. I will be able to provide more logs/dumps as required.

Attempting to reproduce the issue with a sample repo succeeded in behavior but the memory dumps do not look the same, therefor the sample project is not attached to the report

Expected Behavior

Resources allocation exception, process exception, any sort of an error which will resume the .NET internals

Steps To Reproduce

No response

Exceptions (if any)

No exceptions, the process hangs until manually freed (Forceful shutdown)

.NET Version

8.0.409

Anything else?

Image

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 20, 2025
@martincostello
Copy link
Member

Looks like this issue should be transferred to dotnet/runtime as it doesn't appear to be specific to ASP.NET Core.

@BrennanConroy BrennanConroy transferred this issue from dotnet/aspnetcore May 20, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 20, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9
Copy link
Member

mangod9 commented May 21, 2025

@cleardarkz, can you please provide a repro for this issue or a dump when its deadlocked?

@cleardarkz
Copy link
Author

@cleardarkz, can you please provide a repro for this issue or a dump when its deadlocked?

Absolutely,

Here's a Google Drive link:
https://drive.google.com/file/d/1_7KcDgkSuqIA9MFpe_kqkxXE3bbdL9b7/view?usp=sharing

@mangod9
Copy link
Member

mangod9 commented May 21, 2025

Is this a NativeAOT app? I notice that ThreadSuspension appears to be blocked by this reverse pInvoke stack. @VSadov have you seen anything like this before?

 # Call Site
00 ntdll!ZwWaitForSingleObject
01 KERNELBASE!WaitForSingleObjectEx
02 Cymulate_Agent_Service!Thread::RareDisablePreemptiveGC
03 Cymulate_Agent_Service!JIT_ReversePInvokeEnterRare2
04 Cymulate_Agent_Service!JIT_ReversePInvokeEnterTrackTransitions
05 0x0
06 user32!UserCallWinProcCheckWow
07 user32!DispatchClientMessage
08 user32!__fnINDEVICECHANGE
09 ntdll!KiUserCallbackDispatcherContinue
0a win32u!ZwUserGetMessage
0b user32!GetMessageW
0c 0x0
0d 0x0
0e Cymulate_Agent_Service!CallDescrWorkerInternal
0f Cymulate_Agent_Service!CallDescrWorkerWithHandler
10 Cymulate_Agent_Service!DispatchCallSimple
11 Cymulate_Agent_Service!ThreadNative::KickOffThread_Worker
12 Cymulate_Agent_Service!ManagedThreadBase_DispatchInner
13 Cymulate_Agent_Service!ManagedThreadBase_DispatchMiddle
14 Cymulate_Agent_Service!ManagedThreadBase_DispatchOuter
15 Cymulate_Agent_Service!ManagedThreadBase_FullTransition
16 Cymulate_Agent_Service!ManagedThreadBase::KickOff
17 Cymulate_Agent_Service!ThreadNative::KickOffThread
18 kernel32!BaseThreadInitThunk
19 ntdll!RtlUserThreadStart

@VSadov
Copy link
Member

VSadov commented May 21, 2025

Is this a NativeAOT app? I notice that ThreadSuspension appears to be blocked by this reverse pInvoke stack. @VSadov have you seen anything like this before?

This is a normal stack for a thread that is trying to enter managed code while EE is suspended, most likely for GC. In such case the thread will block and wait for EE to resume.

@VSadov
Copy link
Member

VSadov commented May 21, 2025

From the screenshot it looks like

  • it is CoreCLR, not NativeAot (because PulseAll is in native code)
  • one thread tries to initiate GC, thus it starts suspending all managed threads
  • almost all threads are suspended, but not all.
  • the suspending thread is waiting (with timeout) for more progress. If the wait times out it will try checking if threads are all stopped or can be forced to stop via hijacking, interrupts etc..
    Ultimately it is possible for a thread to be in a state where it cannot be suspended. Such states should be transient and should not take long.
  • one thread is running ObjectNative::PulseAll. If it runs in COOP mode it would be in a "can't suspend" state and would prevent EE suspend to complete.

If the PulseAll thread is somehow stuck, EE cannot finish suspending and GC cannot start.

I am not sure how PulseAll can get stuck though - it is just going through a linked list of waiters and sets every waiter's event.

The waiter list would be limited by the number of threads in the program, so should not be long. Perhaps it gets corrupted somehow and became circular?
Or maybe the Set on the event gets stuck - not sure if that can happen.

@cymulateagentteam
Copy link

@VSadov I have reproduced the issue and debugged it live.

The singly linked list had somehow turned into a circular list, which is definitely an issue.
The memory hasn't been corrupted, the link->next value pointed to the same node. See the following:

Image

psb->m_Link.m_pNext = pLink->m_pNext;

causing the following PulseAll to effectively deadlock the GC collection procedure

while ((pWaitEventLink = ThreadQueue::DequeueThread(this)) != NULL)

I have managed to unclog the deadlock in runtime by manually resetting the ->m_pNext value to NULL, restoring the SList expected behavior

I believe the root cause might be manipulation of the SList from code that's unprotected by a critical section

There could be an easy mitigation for this issue, with the following before L240:

if ( psb->m_Link.m_pNext == pLink->m_pNext )
psb->m_Link.m_pNext = NULL; 

cymulateagentteam pushed a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025
… is somehow corrupted into a circular list, causing the engine to enter thread suspension procedure and never exit - Issue dotnet#115794
cymulateagentteam added a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025
cymulateagentteam added a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025
cymulateagentteam added a commit to cymulateagentteam/runtime that referenced this issue May 22, 2025
@mangod9
Copy link
Member

mangod9 commented May 22, 2025

Is there a standalone repro for this? We should properly investigate and fix root cause instead of taking the proposed fix as-is.

@cleardarkz
Copy link
Author

cleardarkz commented May 22, 2025

Is there a standalone repro for this? We should properly investigate and fix root cause instead of taking the proposed fix as-is.

Investigating the root cause might be a lengthy process as there could be so many reasons why this could happen.

The proposed fix is clean and simple, if the condition in the proposed fix is met, you can safely assume the the process is deadlocked.

Its also worth noting that after releasing the deadlock manually by setting ->m_pNext to NULL, the process resumed gracefully without any further hiccups, as if it was never deadlocked.

@jkotas
Copy link
Member

jkotas commented May 22, 2025

We cannot accept changes to silently ignore unexpected data corruptions for security reasons.

It would be ok to check for the data corruption and fail the process immediately. It would replace the hang with a crash.

@VSadov
Copy link
Member

VSadov commented May 22, 2025

I think all changes to the list happen when the corresponding lock is acquired, so how could the list become circular?

I wonder if it is possible to insert a circularity check/failfast at places where a waiter is added or removed from the queue and run the repro?

@cleardarkz
Copy link
Author

We cannot accept changes to silently ignore unexpected data corruptions for security reasons.

It would be ok to check for the data corruption and fail the process immediately. It would replace the hang with a crash.

Understood, we're merely here to provide all the insights we can gather and help find a solution to the issue, if one exists.

@jkotas
Copy link
Member

jkotas commented May 23, 2025

Potentially related: #97034 (another mysterious PulseAll issue)

@jkotas
Copy link
Member

jkotas commented May 23, 2025

The corrupted linked list lives on thread stacks, ie one thread is modifying stack of another thread. It is unusual data structure. We may be hitting a corner case hardware issue with write-back of stack memory. The linked list corruption can be explained by write-back of stack memory being issued twice that is harmless nearly 100% of the time otherwise.

Here is a delta that we can try to test this theory: jkotas@3b340c1 . Could you please give it a try and let us know if you still hit the hang? InterlockedExchange should give a hint to the hardware to avoid delayed write backs.

@cymulateagentteam
Copy link

The corrupted linked list lives on thread stacks, ie one thread is modifying stack of another thread. It is unusual data structure. We may be hitting a corner case hardware issue with write-back of stack memory. The linked list corruption can be explained by write-back of stack memory being issued twice that is harmless nearly 100% of the time otherwise.

Here is a delta that we can try to test this theory: jkotas@3b340c1 . Could you please give it a try and let us know if you still hit the hang? InterlockedExchange should give a hint to the hardware to avoid delayed write backs.

I am positive this is going to work, however, I am not entirely sure we can definitely pin this over hardware-caused delayed write-backs.
By assuming this theory is correct and the issue is write-back of stack memory, we simply cannot trust anything to be consistent in the operating system.

Also, this issue has been reproducing specifically with this list way too many times for it to be statistically probable to suffer specifically from delayed write-backs. (although I do believe everything is possible)

If I had to guess, I would lean towards insufficient locking mechanism (incorrect lock objects? missing lock?) or uninitialized fields that creates a once-in-a-while race condition status where the list may be corrupted provided the right stars align

I think performing InterlockedExchange would solve insufficient locking in this case as well as solving delayed write-backs.
I suppose we could give it a go, however, the issue reproduces once every couple of days at best, sometimes it takes weeks.

Another insight - by assuming delayed write-backs could happen so relatively often, I would expect a LOT of processes to break over this behavior. For instance, Windows uses Critical Sections in the NT Loader mechanism to synchronize linked lists across threads, for proper modules initialization and usage, these kinds of swaps in linked lists are very popular in the NT loader implementation (fundamentally the back-bone of Windows) I would expect the lists in the NT Loader to corrupt way before the lists are corrupted in the process' .NET runtime, I would bet a single .NET process' GC collector runs less frequently than the entire NT Loader implementation in every Windows process

@jkotas
Copy link
Member

jkotas commented May 25, 2025

Windows uses Critical Sections in the NT Loader mechanism to synchronize linked lists across threads, for proper modules initialization and usage, these kinds of swaps in linked lists are very popular in the NT loader implementation (fundamentally the back-bone of Windows)

This linked list is not an ordinary linked list accessed by multiple threads. Ordinary linked lists accessed by multiple threads are allocated on heap.

This linked list is allocated on stacks of multiple threads. One thread has pointers into stacks of other threads and uses those pointers to read and write stack memory on other threads. This is very unusual. I cannot think of other examples where code uses a data structure like that. Are you aware of any (e.g. in NT loader)?

I would lean towards insufficient locking mechanism (incorrect lock objects? missing lock?) or uninitialized fields that creates a once-in-a-while race condition status

I would lean towards this as well. However, I am not able to find evidence of anything like that in the crash dumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants