-
Notifications
You must be signed in to change notification settings - Fork 5k
GC hangs indefinitely on LongRunning #115794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks like this issue should be transferred to dotnet/runtime as it doesn't appear to be specific to ASP.NET Core. |
Tagging subscribers to this area: @dotnet/gc |
@cleardarkz, can you please provide a repro for this issue or a dump when its deadlocked? |
Absolutely, Here's a Google Drive link: |
Is this a NativeAOT app? I notice that ThreadSuspension appears to be blocked by this reverse pInvoke stack. @VSadov have you seen anything like this before?
|
This is a normal stack for a thread that is trying to enter managed code while EE is suspended, most likely for GC. In such case the thread will block and wait for EE to resume. |
From the screenshot it looks like
If the PulseAll thread is somehow stuck, EE cannot finish suspending and GC cannot start. I am not sure how PulseAll can get stuck though - it is just going through a linked list of waiters and sets every waiter's event. The waiter list would be limited by the number of threads in the program, so should not be long. Perhaps it gets corrupted somehow and became circular? |
@VSadov I have reproduced the issue and debugged it live. The singly linked list had somehow turned into a circular list, which is definitely an issue. runtime/src/coreclr/vm/syncblk.cpp Line 240 in 7e7e195
causing the following PulseAll to effectively deadlock the GC collection procedure runtime/src/coreclr/vm/syncblk.cpp Line 2877 in 7e7e195
I have managed to unclog the deadlock in runtime by manually resetting the ->m_pNext value to NULL, restoring the SList expected behavior I believe the root cause might be manipulation of the SList from code that's unprotected by a critical section There could be an easy mitigation for this issue, with the following before L240:
|
… is somehow corrupted into a circular list, causing the engine to enter thread suspension procedure and never exit - Issue dotnet#115794
Is there a standalone repro for this? We should properly investigate and fix root cause instead of taking the proposed fix as-is. |
Investigating the root cause might be a lengthy process as there could be so many reasons why this could happen. The proposed fix is clean and simple, if the condition in the proposed fix is met, you can safely assume the the process is deadlocked. Its also worth noting that after releasing the deadlock manually by setting ->m_pNext to NULL, the process resumed gracefully without any further hiccups, as if it was never deadlocked. |
We cannot accept changes to silently ignore unexpected data corruptions for security reasons. It would be ok to check for the data corruption and fail the process immediately. It would replace the hang with a crash. |
I think all changes to the list happen when the corresponding lock is acquired, so how could the list become circular? I wonder if it is possible to insert a circularity check/failfast at places where a waiter is added or removed from the queue and run the repro? |
Understood, we're merely here to provide all the insights we can gather and help find a solution to the issue, if one exists. |
Potentially related: #97034 (another mysterious PulseAll issue) |
The corrupted linked list lives on thread stacks, ie one thread is modifying stack of another thread. It is unusual data structure. We may be hitting a corner case hardware issue with write-back of stack memory. The linked list corruption can be explained by write-back of stack memory being issued twice that is harmless nearly 100% of the time otherwise. Here is a delta that we can try to test this theory: jkotas@3b340c1 . Could you please give it a try and let us know if you still hit the hang? InterlockedExchange should give a hint to the hardware to avoid delayed write backs. |
I am positive this is going to work, however, I am not entirely sure we can definitely pin this over hardware-caused delayed write-backs. Also, this issue has been reproducing specifically with this list way too many times for it to be statistically probable to suffer specifically from delayed write-backs. (although I do believe everything is possible) If I had to guess, I would lean towards insufficient locking mechanism (incorrect lock objects? missing lock?) or uninitialized fields that creates a once-in-a-while race condition status where the list may be corrupted provided the right stars align I think performing InterlockedExchange would solve insufficient locking in this case as well as solving delayed write-backs. Another insight - by assuming delayed write-backs could happen so relatively often, I would expect a LOT of processes to break over this behavior. For instance, Windows uses Critical Sections in the NT Loader mechanism to synchronize linked lists across threads, for proper modules initialization and usage, these kinds of swaps in linked lists are very popular in the NT loader implementation (fundamentally the back-bone of Windows) I would expect the lists in the NT Loader to corrupt way before the lists are corrupted in the process' .NET runtime, I would bet a single .NET process' GC collector runs less frequently than the entire NT Loader implementation in every Windows process |
This linked list is not an ordinary linked list accessed by multiple threads. Ordinary linked lists accessed by multiple threads are allocated on heap. This linked list is allocated on stacks of multiple threads. One thread has pointers into stacks of other threads and uses those pointers to read and write stack memory on other threads. This is very unusual. I cannot think of other examples where code uses a data structure like that. Are you aware of any (e.g. in NT loader)?
I would lean towards this as well. However, I am not able to find evidence of anything like that in the crash dumps. |
Is there an existing issue for this?
Describe the bug
Sometimes when running a background worker for a lengthy period of time, it appears that the GC tries to run a collection round, suspends all threads but gets interlocked while trying to free resources
We have experienced this behavior on long running processes, usually a couple of hours or more. I will add that the codebase is using GetAwaiter.Result() extensively, however, it appears that even in this case, there should be some sort of a fallback to when the GC fails to run
I have attached a dump diagram of the latest occurrence. I will be able to provide more logs/dumps as required.
Attempting to reproduce the issue with a sample repo succeeded in behavior but the memory dumps do not look the same, therefor the sample project is not attached to the report
Expected Behavior
Resources allocation exception, process exception, any sort of an error which will resume the .NET internals
Steps To Reproduce
No response
Exceptions (if any)
No exceptions, the process hangs until manually freed (Forceful shutdown)
.NET Version
8.0.409
Anything else?
The text was updated successfully, but these errors were encountered: