Add retry functionality for failed model transfer jobs#2274
Add retry functionality for failed model transfer jobs#2274mturley wants to merge 9 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a85d364 to
b5d5bdc
Compare
|
tested locally - when I clicked on edit resource name, the resource name field doesn't appear. Screen.Recording.2026-02-27.at.3.47.47.PM.mov |
| <Stack hasGutter> | ||
| <StackItem> | ||
| A new transfer job will be created for the{' '} | ||
| <strong>{job.modelVersionName}</strong> model version. |
There was a problem hiding this comment.
I think based on intent we should be changing this <strong>{job.modelVersionName}</strong> to model name/version name?
| ModalFooter, | ||
| } from '@patternfly/react-core'; | ||
| import { ModelTransferJob } from '~/app/types'; | ||
| import ResourceNameDefinitionTooltip from '~/concepts/k8s/ResourceNameDefinitionTootip'; |
There was a problem hiding this comment.
typo in the file name: /ResourceNameDefinitionTootip -> /ResourceNameDefinitionTooltip
The UpdateModelTransferJob function was not properly recovering metadata fields from the original job when creating a retry. This caused the retry job to lose important information like Author, Description, model format, and custom properties. Added recovery of: - Author and Description from job annotations - VersionDescription, SourceModelFormat, SourceModelFormatVersion from ConfigMap - ModelCustomProperties and VersionCustomProperties from ConfigMap JSON Also added comprehensive tests to verify retry metadata preservation. Fixes part of RHOAIENG-38267 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Implements the frontend UI for retrying failed transfer jobs: - Add RetryJobModal component with: - Auto-generated retry job name (adds -2, -3, etc. suffix) - Editable resource name with K8s validation - Checkbox to delete the failed job after retry - Error handling and loading states - Add Retry button next to Failed status label in the jobs table - Wire up retry API call to BFF PATCH endpoint with deleteOldJob param - Update API types to support additional query parameters Part of RHOAIENG-38267 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Tests cover: - Modal rendering with correct title and description - Auto-generated retry job name with -2 suffix - Incrementing existing numeric suffixes - Delete checkbox default state - Form validation for K8s resource names - Retry button click handling - Error display on failure - Loading state during retry - Modal close on success Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Update mockJob to include source, destination, and lastUpdateTimeSinceEpoch fields that are now required by the ModelTransferJob type. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
- Change modelFormatName to model_format_name - Change modelFormatVersion to model_format_version - Update test to use HavePrefix for secret names since GenerateName now adds random suffixes Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Renamed test from "preserves credentials by reusing secrets" to "clones credentials from old job into new secrets" to make it clear that we create NEW Secret objects (not reuse existing ones). This ensures that deleting the old job doesn't affect the new job's secrets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Add "model version" after the version name in the modal description to match the design: "A new transfer job will be created for the <name> model version." Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
- Fix typo: ResourceNameDefinitionTootip → ResourceNameDefinitionTooltip - Update modal description to be intent-based: - CREATE_MODEL: shows model name - CREATE_VERSION/UPDATE_ARTIFACT: shows version name and model name - Add test for CREATE_VERSION intent description Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
669d816 to
d42db7d
Compare
oops. at one point it was using the existing K8sNameDescriptionField component instead of trying to write its own, and at some AI iteration I failed to re-review it. AI growing pains, sorry. Fixing it |
Replace custom form field implementation with the existing K8sNameDescriptionField component to avoid code duplication. This provides consistent UX for k8s resource name handling including auto-generation and validation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Description
This PR implements the retry functionality for failed model transfer jobs.
BFF Changes
Fixes the
UpdateModelTransferJobfunction to properly preserve metadata when retrying a failed job:Problem: When a user retries a failed transfer job, the existing implementation was not recovering important metadata fields from the original job, causing the retry to lose information like Author, Description, model format, and custom properties.
Solution: Added recovery logic to read metadata from:
Credential Cloning: When retrying a job without providing new credentials, the BFF clones the credentials from the old job's secrets into new Secret objects. This ensures that deleting the old job (via the "Delete failed job" checkbox) doesn't affect the new job's secrets.
Compatibility: Updated to use snake_case ConfigMap field names (
model_format_name,model_format_version) per PR #2281.Frontend Changes
Implements the UI for retrying failed transfer jobs:
RetryJobModal component with:
-2,-3, etc. suffix)Retry button next to the "Failed" status label in the jobs table (kebab menu has only Delete action)
API integration with BFF PATCH endpoint supporting
deleteOldJobquery paramScreenshots
The retry button appears next to the Failed status label. Clicking it opens the retry modal where users can:
How Has This Been Tested?
BFF Tests:
TestModelTransferJob retry metadata preservationwith tests:make testinclients/ui/bff/Frontend Tests:
npm run test:unit(508 tests)Merge criteria:
If you have UI changes
🤖 Generated with Claude Code