You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(jobs): close prerun-clobber + orphan-NATS-dispatch races
Review on #1324 surfaced two races that left the early-guard non-functional
in production:
1. ``task_prerun`` (``pre_update_job_status``) wrote PENDING to the row
before the ``run_job`` body inspected status. A canceled or redelivered
message therefore had its REVOKED/CANCELING overwritten with PENDING,
and the early-guard added in the parent commit never tripped. The
existing tests passed only because they invoked ``run_job.apply(args=[…])``
while production uses ``kwargs={"job_id": …}`` — under args, the prerun
handler raised ``KeyError`` and exited silently. Switching the tests to
``kwargs=`` reproduces the production code path; the prerun handler now
short-circuits when ``Job.is_settled()`` is true, preserving the status
the early-guard reads next.
2. For ASYNC_API jobs ``Job.cancel()`` revokes without ``terminate=True``,
marks the row REVOKED, and tears down the NATS stream + Redis state.
``MLJob.run`` running in a worker that's still inside ``collect_images``
(slow for large collections) would then proceed to ``queue_images_to_nats``
and recreate the stream the cancel just deleted, dispatching real GPU
work to ADC for a revoked job; the results came back to no Redis state
and ``_fail_job`` silently overwrote REVOKED with FAILURE. The bootstrap
now checks ``Job.status`` (via a values-only read so the in-memory
``progress`` mutations don't clobber the cancel's REVOKED) right after
the collect stage and bails out before any dispatch.
Adds ``Job.is_settled()`` to centralize the "terminal or being torn down"
predicate that ``run_job``'s early-guard, the prerun handler, ``_fail_job``,
and the bootstrap guard all needed. Adds two regression tests: one for the
prerun-then-guard chain, one for the cancel-during-bootstrap race.
Co-Authored-By: Claude <noreply@anthropic.com>
0 commit comments