fix(experiment): propagate fresh ResourceVersion after spec rollback#2930
fix(experiment): propagate fresh ResourceVersion after spec rollback#2930
Conversation
After a timeout/stopped rollback, the spec Update bumped the DDA ResourceVersion but the caller's in-memory instance kept the stale RV, causing the status write to 409 and defer the terminal phase to the next reconcile. On that next reconcile the spec already matched a freshly-timestamped, rollback-annotated revision, so findMostRecent returned it, elapsed < timeout, and the terminal phase was never persisted — a race that made integration tests flaky on CI. restorePreviousSpec/rollback now take *DatadogAgent and update instance.ResourceVersion after the client.Update (or match), so the status write in the same reconcile succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: acd2abafde
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Propagate the post-Update ResourceVersion so the downstream status write | ||
| // in the same reconcile uses the fresh RV and the terminal phase is | ||
| // persisted without requiring a follow-up reconcile. | ||
| instance.ResourceVersion = toUpdate.ResourceVersion |
There was a problem hiding this comment.
Preserve status concurrency when refreshing the RV
When another status writer updates this DatadogAgent after the reconcile read but before this rollback update (for example RC changing .status.experiment), copying the post-update ResourceVersion onto the stale in-memory object makes updateStatusIfNeededV2 later deep-copy and Status().Update the old status without getting the conflict that would normally protect that concurrent change. Please merge the terminal experiment phase into the freshly fetched status or use a targeted status patch instead of only transplanting the RV.
Useful? React with 👍 / 👎.
This comment has been minimized.
This comment has been minimized.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2930 +/- ##
==========================================
+ Coverage 40.69% 40.71% +0.01%
==========================================
Files 321 321
Lines 28413 28420 +7
==========================================
+ Hits 11563 11570 +7
Misses 16015 16015
Partials 835 835
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Replace the ResourceVersion transplant with a targeted status subresource merge patch that only writes the terminal experiment phase. This addresses the codex review finding: propagating the post-Update ResourceVersion onto the caller's stale in-memory instance caused updateStatusIfNeededV2 to silently overwrite any concurrent status writes made between our initial fetch and our spec Update (e.g. RC daemon updating status.experiment). The targeted patch lands the terminal phase in the same reconcile as the rollback (fixing the original flake) while leaving the full-status write's 409 protection intact for every other status field. The patch is best-effort — a transient failure just defers the phase write to the next reconcile, same as before. Adds Test_Experiment_RollbackPersistsInSingleReconcile regression guard asserting both the spec rollback and phase=timeout persist in a single reconcile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rollback() no longer mutates ResourceVersion, so it doesn't need the typed pointer — revert to the original metav1.ObjectMeta signature for a more minimal diff. restorePreviousSpec keeps the *DatadogAgent signature since it needs the typed object for Status().Patch(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 851f223cbe
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // reconcile will re-compute and retry. | ||
| func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, phase v2alpha1.ExperimentPhase) { | ||
| patch := fmt.Appendf(nil, `{"status":{"experiment":{"phase":%q}}}`, phase) | ||
| if err := r.client.Status().Patch(ctx, instance, client.RawPatch(types.MergePatchType, patch)); err != nil { |
There was a problem hiding this comment.
Avoid refreshing the status RV before the full status update
When this raw status patch succeeds it decodes the server response back into instance, advancing instance.ResourceVersion even though newStatus was computed from the object fetched at the start of reconcile. If Fleet Automation/RC updates any status field (for example a new experiment ID or phase) between that initial fetch and this patch, the later updateStatusIfNeededV2 full Status().Update will use the patched RV and replace status with stale newStatus instead of 409'ing as before, clobbering the concurrent status write. Patch into a scratch object or refetch/merge before the full status update so other status fields keep their conflict protection.
Useful? React with 👍 / 👎.
controller-runtime's Status().Patch decodes the server response back into the passed object, advancing ResourceVersion. Calling it on the caller's *DatadogAgent instance would let the subsequent full-status write in updateStatusIfNeededV2 use the patched RV and silently overwrite any status field that a concurrent writer (e.g. the RC daemon) updated between our initial reconcile fetch and this patch. Patch into a DeepCopy scratch so instance.ResourceVersion stays at the value observed at reconcile start. The full-status write's 409 protection is retained, and the terminal experiment phase still lands in the same reconcile via the targeted patch. Adds Test_Experiment_RollbackPreservesConcurrentStatusWrite regression guard asserting instance.ResourceVersion is not mutated by restorePreviousSpec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b2aa3b359
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, phase v2alpha1.ExperimentPhase) { | ||
| patch := fmt.Appendf(nil, `{"status":{"experiment":{"phase":%q}}}`, phase) | ||
| scratch := instance.DeepCopy() | ||
| if err := r.client.Status().Patch(ctx, scratch, client.RawPatch(types.MergePatchType, patch)); err != nil { |
There was a problem hiding this comment.
Do not overwrite concurrent experiment phase updates
When the RC daemon accepts a promote/stop after this reconcile fetched instance but before this patch runs, this raw merge patch still writes status.experiment.phase from the stale decision without any resourceVersion/phase precondition. For example, pkg/fleet/experiment.go allows promote from running and pkg/fleet/daemon.go persists phase=promoted; this patch can then change that accepted promotion back to timeout/rollback, whereas the previous full status update would have conflicted and retried. That can leave Fleet believing the experiment was promoted while the DDA status/spec reflect a rollback.
Useful? React with 👍 / 👎.
The raw merge patch wrote status.experiment.phase without any precondition. If a concurrent writer (e.g. the RC daemon accepting a promotion at pkg/fleet/daemon.go:354) moved the phase between the reconcile's initial fetch and the restorePreviousSpec patch, the merge patch would silently overwrite their accepted state with our stale terminal-phase decision — potentially flipping an accepted promotion back to timeout/rollback and leaving Fleet's view inconsistent with the DDA. Switch to a JSON patch with a `test` op that asserts the server's current phase still matches the phase the rollback decision was based on. If it doesn't, the test fails, the patch is rejected, and the concurrent writer's state survives — mirroring the 409-and-retry semantics of a full-status Update. Adds Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite regression guard; fixes TestRestorePreviousSpec_PhaseSetOnlyOnSuccess to populate instance.Status.Experiment (required by the new observed- phase capture in restorePreviousSpec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0deaf5e48d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return err | ||
| } | ||
| newStatus.Experiment.Phase = terminalPhase | ||
| r.patchExperimentPhase(ctx, instance, observedPhase, terminalPhase) |
There was a problem hiding this comment.
Annotate rollback revision before persisting terminal phase
When the operator process exits after this patch succeeds but before the annotateRevision call below runs, the terminal phase has already been persisted. On restart, handleRollback ignores timeout/rollback, so the highest experiment revision never gets annotationExperimentRollback; re-applying the same spec later can hit the stale CreationTimestamp path and immediately time out. Persist the phase only after the annotation attempt, or make terminal-phase reconciles repair a missing rollback annotation.
Useful? React with 👍 / 👎.
The previous order was: patch terminal phase → annotate experiment revision. If the operator crashed between these two writes, the server held a terminal phase but the revision was unannotated. On restart, handleRollback ignores terminal phases (the switch only handles Stopped/Running), so the annotation would never be applied — a later re-apply of the same spec would then hit the stale CreationTimestamp path in findMostRecentMatchingRevision and fire an immediate false timeout. Swap to annotate-then-patch. If a crash happens between them now, the phase is still non-terminal on restart, handleRollback re-fires, annotateRevision is idempotent (early-returns when already annotated), and patchExperimentPhase runs again. Eventually both writes succeed. Also extends Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite to assert the annotation lands regardless of whether the phase patch was rejected by the test op — the annotation is protection for future re-applies and must not be gated on phase-write success. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0dadba1e3c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Best-effort: any error here (test failure or transient) is logged at V(1) | ||
| // and the next reconcile will re-compute and retry. | ||
| func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, expectedCurrent, newPhase v2alpha1.ExperimentPhase) { | ||
| patch := fmt.Appendf(nil, `[{"op":"test","path":"/status/experiment/phase","value":%q},{"op":"replace","path":"/status/experiment/phase","value":%q}]`, expectedCurrent, newPhase) |
There was a problem hiding this comment.
Guard phase patch with experiment identity
When Fleet moves the same DDA through another transition that returns to the same phase before this patch runs—for example promoting the timed-out experiment and then starting a new one, which records a different status.experiment.id with phase=running—this test still passes because it only checks phase. The raw patch then marks the newer experiment timeout/rollback, whereas the previous RV-protected status update would have conflicted instead of overwriting a newer status; include the observed experiment ID (or the whole experiment status) in the patch precondition.
Useful? React with 👍 / 👎.
What does this PR do?
Persists the terminal experiment phase (
timeout/rollback) in the same reconcile as the spec rollback, via a targetedStatus().Patch()guarded by a JSON patchtestop onstatus.experiment.phase. This eliminates a cross-reconcile race where the terminal phase could be silently dropped, without compromising concurrent-write safety or crash-recovery of the rollback annotation.Motivation
Test_Experiment_StateTransitionssubtests (rollback_then_timeout,promoted_then_timeout, with/withoutstale_old_revision) were flaking on CI — failing withexpected=timeout, actual=runningatcontroller_experiment_integration_test.go:947. Reproduced locally (5+ failures in 20 runs).Root cause: reconcile 1 of the timeout path updated spec (bumping
ResourceVersion) but the full-status write used the stale RV → 409 →phase=runningpersisted. On reconcile 2, the refetched spec already matched a freshly-timestamped, rollback-annotated revision, sofindMostRecentMatchingRevisionreturned it,elapsed < timeout, and the terminal phase was never written.Fix
restorePreviousSpecnow writes the terminal phase via a targeted status subresource JSON patch (patchExperimentPhase) after the spec rollback. The patch is narrow (onlystatus.experiment.phase) and carries atestop that asserts the server's phase still matches the phase the rollback decision was based on. Four safety properties are preserved:phase=timeout/phase=rollbackpersists without waiting for a follow-up reconcile.pkg/fleet/daemon.go:354writingphase=promoted) moves the phase between our reconcile fetch and our patch, thetestop fails and the patch is rejected — our stale decision does not clobber their accepted state. Mirrors the 409-and-retry semantics of a full-status Update.instance.DeepCopy()scratch so the client's response decoding does not advance the caller'sinstance.ResourceVersion. The downstream full-status write inupdateStatusIfNeededV2therefore retains its own 409 protection for every other status field.annotateRevisionruns BEFOREpatchExperimentPhase. If the operator exits between the two writes, the phase is still non-terminal on restart,handleRollbackre-fires,annotateRevisionis idempotent, and both writes eventually succeed. Reversing this order would leave a terminal phase without the rollback annotation — and sincehandleRollbackignores terminal phases, the annotation would never be applied, causing a later re-apply of the same spec to fire an immediate false timeout on the staleCreationTimestamp.The patch is best-effort — any error (test op failure or transient) is logged at V(1) and the next reconcile re-computes and retries.
Additional Notes
restorePreviousSpecsignature change: now takes*v2alpha1.DatadogAgentinstead ofmetav1.ObjectMeta(needed forStatus().Patch()).rollbacksignature unchanged (metav1.ObjectMeta).TestRestorePreviousSpec_PhaseSetOnlyOnSuccesswas updated to populateinstance.Status.Experiment(the new observed-phase capture inrestorePreviousSpecdereferences it — safe in production becausehandleRollbackguards against nil before delegating, but needed for this direct unit-test call).Test_Experiment_RollbackPersistsInSingleReconcile— asserts spec rollback andphase=timeoutboth persist in a single reconcile.Test_Experiment_RollbackPreservesConcurrentStatusWrite— assertsrestorePreviousSpecdoes not mutate the caller'sResourceVersion(bisected: reverting the scratch-copy fails this test).Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite— asserts a concurrentphase=promotedwrite survives a stalephase=rollbackdecision, AND that the rollback annotation still lands even when the phase patch is rejected (bisected: reverting to an unconditional merge patch fails the first assertion).Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
go test ./internal/controller/datadogagent/ -run Test_Experiment_StateTransitions -count=20— multiple subtest failures before fix.Test_Experiment_StateTransitions + Test_Experiment_RollbackPersistsInSingleReconcile + Test_Experiment_RollbackPreservesConcurrentStatusWrite + Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite-count=20— all green.Test_Experiment_|TestManageExperiment|TestHandleRollback|TestRollback|TestRestorePreviousSpec|TestReapplySameSpecAfterRollback)-count=10— clean.internal/controller/datadogagentpackage — clean.Checklist
bugv1.26.0(alsoqa/skip-qa— test-infrastructure stability, no user-facing change)🤖 Generated with Claude Code