-
Notifications
You must be signed in to change notification settings - Fork 1.2k
workload repair can recover from corrupt workload sets
#52434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/10.0.1xx
Are you sure you want to change the base?
workload repair can recover from corrupt workload sets
#52434
Conversation
Plan: (this commit is step 1 of the plan but idk if it works completely yet) - Workload Manifest Reader will always fail with `ManifestFromWorkloadSetNotFound` when calling GetManifests. I considered removing the workload set entirely during workload set search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction. Removing the error is not sufficient because then we never recover from the bad state or detect the bad state. So, 1. Add a function to repair so it can repair missing manifests, because before it called ReinstallWorkloadsBasedOnCurrentManifests which calls install workloads, which calls install packs, which needs to read the manifests to know the packs, so repair would fail. 2. Once repair is able to fix the issue, resolve the other commands to be able to fix the issue by A - detecting corrupt issue, and then calling the repair subcomponent to install missing manifests 3. Make sure repair isn't cyclical and doesn't try to call the sub repair option (should be ok, because we don't get to the missing manifests code until we've already repaired the manifests) 4. Add boolean for manifest reader or whatever to print during info, version or list that there was a corrupt workload set likely from pkg manager but still allow repairing it (or decide if we don't want to do that.)
workload restore is not setup for the injection so didn't add it there but the code is shared between the two so testing like this should be decent
…ommand so it can be shared
…rkload command so it can be shared" This reverts commit a3959ba.
… also fixed this prevents needing to inherit or modify many command files at once
workload history recorder tries to get manifests and it does this before file baesd installer is created.
…b.com/nagilson/sdk into nagilson-recover-manifests-on-install
…ble to repair the manifests - the repair family commands should be able repair
the install state is empty on disk - this is not safe to rely on. Also if we refresh manifests we will create a dupe manifest
src/Cli/dotnet/Commands/Workload/WorkloadManifestCorruptionRepairer.cs
Outdated
Show resolved
Hide resolved
workload x can recover from corrupt workload setsworkload repair can recover from corrupt workload sets
nagilson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some polish needed if this isnt urgent:
- The error string should likely contain the workload id and not just the version
- We should try to dedupe the message if possible, despite separate assemblies
- The MissingManifests logic is also implemented in two approaches, which has potential for consolidation
- It may not be required to attach in the factory method and we should reconsider if we can do that at the IInstaller level
- The sdkFeatureBand parameter is unnecessary in the CorruptionRepairer
|
The test failure seems to be due to a change in the new, more specific error message presenting in some cases instead of an existing error message. |
|
I tested in codespaces. I can confirm that if I delete any workload manifest (after installing), I will get errors during --info, workload --info, and build. I can also confirm that repair, restore, and update all fix the problem. The error experiences themselves could probably be cleaned up a bit as they give full exception stacks in 2/3 of those flows (workload --info is the best Ux). But this is sufficient for unblocking the customer. Need to fix the tests and review. |
| // Licensed to the .NET Foundation under one or more agreements. | ||
| // The .NET Foundation licenses this file to you under the MIT license. | ||
|
|
||
| #nullable disable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Probably shouldn't be making new files with #nullable disable.
…test the E2E but I couldn't make that work.
context
for #52090
copilot failed attempt #52338
Issue: The Workload Manifest Reader can fail with
ManifestFromWorkloadSetNotFoundwhen callingGetManifestswhen package managers clean up manifests from older sdk versions that were still needed by the newer version workloads.❗ This PR is better reviewed with hiding whitespace changes.
Manual Testing Loop Demo
What's this do
I considered removing the workload set entirely during workload manifest search if the manifests didn't exist and it was corrupt but this would prevent the repair from working and still require user interaction.
Removing the error is not sufficient because then we never recover from the bad state or detect the bad state.
So, now we have a separate class which tries to repair missing manifests before we start the manifest lookup. The errors for missing manifests are disabled in the recorder because the recorder will fail to initialize if the manifests are corrupted and it doesn't have the data available to repair or recover from that state. We do the sub-manifest repair because the manifests need to be in a good state for the rest of repair to have the correct data to function. The repairer being hooked into the Installer means commands such as repair will get the hook to do so, but commands such as
--infonever initialize an installer through the factory, so they instead print a prettier error.Notes
I tested and confirmed:
repair, restore, update are fixed.
--info, workload --info, workload list show the small error.