Skip to content

Add retry functionality for failed model transfer jobs#2274

Open
mturley wants to merge 9 commits intokubeflow:mainfrom
mturley:RHOAIENG-27992-retry-job
Open

Add retry functionality for failed model transfer jobs#2274
mturley wants to merge 9 commits intokubeflow:mainfrom
mturley:RHOAIENG-27992-retry-job

Conversation

@mturley
Copy link
Contributor

@mturley mturley commented Feb 24, 2026

Description

This PR implements the retry functionality for failed model transfer jobs.

Screenshot 2026-02-26 at 3 18 41 PM Screenshot 2026-02-26 at 3 18 46 PM

BFF Changes

Fixes the UpdateModelTransferJob function to properly preserve metadata when retrying a failed job:

Problem: When a user retries a failed transfer job, the existing implementation was not recovering important metadata fields from the original job, causing the retry to lose information like Author, Description, model format, and custom properties.

Solution: Added recovery logic to read metadata from:

  • Job annotations (Author, Description)
  • ConfigMap data (VersionDescription, SourceModelFormat, SourceModelFormatVersion)
  • ConfigMap JSON (ModelCustomProperties, VersionCustomProperties)

Credential Cloning: When retrying a job without providing new credentials, the BFF clones the credentials from the old job's secrets into new Secret objects. This ensures that deleting the old job (via the "Delete failed job" checkbox) doesn't affect the new job's secrets.

Compatibility: Updated to use snake_case ConfigMap field names (model_format_name, model_format_version) per PR #2281.

Frontend Changes

Implements the UI for retrying failed transfer jobs:

  • RetryJobModal component with:

    • Auto-generated retry job name (adds -2, -3, etc. suffix)
    • Editable resource name with K8s DNS-1123 validation
    • Checkbox to delete the failed job after successful retry
    • Error handling and loading states
  • Retry button next to the "Failed" status label in the jobs table (kebab menu has only Delete action)

  • API integration with BFF PATCH endpoint supporting deleteOldJob query param

Screenshots

The retry button appears next to the Failed status label. Clicking it opens the retry modal where users can:

  1. See the auto-generated new job name
  2. Optionally edit the resource name
  3. Choose whether to delete the failed job
  4. Confirm the retry operation

How Has This Been Tested?

BFF Tests:

  • Added new test context TestModelTransferJob retry metadata preservation with tests:
    • Verifies PATCH with only new name preserves source, destination, and metadata from old job
    • Verifies PATCH clones credentials from old job into new secrets (creating new Secret objects, not reusing existing ones)
  • All existing BFF tests pass: make test in clients/ui/bff/

Frontend Tests:

  • Added 17 unit tests for RetryJobModal component covering:
    • Modal rendering and content
    • Auto-generated job name with suffix logic
    • Form validation for K8s names
    • Retry button interaction
    • Error handling and loading states
  • TypeScript compilation passes
  • ESLint passes
  • All unit tests pass: npm run test:unit (508 tests)

Merge criteria:

  • The commits have meaningful messages
  • Automated tests are provided as part of the PR
  • The developer has manually tested the changes and verified that the changes work.
  • Code changes follow the kubeflow contribution guidelines.

If you have UI changes

  • The developer has added tests or explained why testing cannot be added.
  • Included any necessary screenshots or gifs if it was a UI change.
  • Verify that UI/UX changes conform the UX guidelines for Kubeflow.

🤖 Generated with Claude Code

@google-oss-prow
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from mturley. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mturley mturley changed the title [WIP] Fix metadata preservation when retrying model transfer jobs Add retry functionality for failed model transfer jobs Feb 24, 2026
@mturley mturley changed the title Add retry functionality for failed model transfer jobs [WIP] Add retry functionality for failed model transfer jobs Feb 25, 2026
@mturley mturley force-pushed the RHOAIENG-27992-retry-job branch 3 times, most recently from a85d364 to b5d5bdc Compare February 26, 2026 20:03
@mturley mturley changed the title [WIP] Add retry functionality for failed model transfer jobs Add retry functionality for failed model transfer jobs Feb 26, 2026
@ppadti
Copy link
Contributor

ppadti commented Feb 27, 2026

tested locally - when I clicked on edit resource name, the resource name field doesn't appear.

Screen.Recording.2026-02-27.at.3.47.47.PM.mov

<Stack hasGutter>
<StackItem>
A new transfer job will be created for the{' '}
<strong>{job.modelVersionName}</strong> model version.
Copy link
Contributor

@ppadti ppadti Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think based on intent we should be changing this <strong>{job.modelVersionName}</strong> to model name/version name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, fixing.

ModalFooter,
} from '@patternfly/react-core';
import { ModelTransferJob } from '~/app/types';
import ResourceNameDefinitionTooltip from '~/concepts/k8s/ResourceNameDefinitionTootip';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in the file name: /ResourceNameDefinitionTootip -> /ResourceNameDefinitionTooltip

mturley and others added 8 commits February 27, 2026 14:39
The UpdateModelTransferJob function was not properly recovering metadata
fields from the original job when creating a retry. This caused the retry
job to lose important information like Author, Description, model format,
and custom properties.

Added recovery of:
- Author and Description from job annotations
- VersionDescription, SourceModelFormat, SourceModelFormatVersion from ConfigMap
- ModelCustomProperties and VersionCustomProperties from ConfigMap JSON

Also added comprehensive tests to verify retry metadata preservation.

Fixes part of RHOAIENG-38267

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Implements the frontend UI for retrying failed transfer jobs:

- Add RetryJobModal component with:
  - Auto-generated retry job name (adds -2, -3, etc. suffix)
  - Editable resource name with K8s validation
  - Checkbox to delete the failed job after retry
  - Error handling and loading states

- Add Retry button next to Failed status label in the jobs table
- Wire up retry API call to BFF PATCH endpoint with deleteOldJob param
- Update API types to support additional query parameters

Part of RHOAIENG-38267

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Tests cover:
- Modal rendering with correct title and description
- Auto-generated retry job name with -2 suffix
- Incrementing existing numeric suffixes
- Delete checkbox default state
- Form validation for K8s resource names
- Retry button click handling
- Error display on failure
- Loading state during retry
- Modal close on success

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Update mockJob to include source, destination, and lastUpdateTimeSinceEpoch
fields that are now required by the ModelTransferJob type.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
- Change modelFormatName to model_format_name
- Change modelFormatVersion to model_format_version
- Update test to use HavePrefix for secret names since GenerateName
  now adds random suffixes

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Renamed test from "preserves credentials by reusing secrets" to
"clones credentials from old job into new secrets" to make it clear
that we create NEW Secret objects (not reuse existing ones). This
ensures that deleting the old job doesn't affect the new job's secrets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Add "model version" after the version name in the modal description
to match the design: "A new transfer job will be created for the
<name> model version."

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
- Fix typo: ResourceNameDefinitionTootip → ResourceNameDefinitionTooltip
- Update modal description to be intent-based:
  - CREATE_MODEL: shows model name
  - CREATE_VERSION/UPDATE_ARTIFACT: shows version name and model name
- Add test for CREATE_VERSION intent description

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
@mturley mturley force-pushed the RHOAIENG-27992-retry-job branch from 669d816 to d42db7d Compare February 27, 2026 19:50
@mturley
Copy link
Contributor Author

mturley commented Feb 27, 2026

@ppadti

tested locally - when I clicked on edit resource name, the resource name field doesn't appear.

oops. at one point it was using the existing K8sNameDescriptionField component instead of trying to write its own, and at some AI iteration I failed to re-review it. AI growing pains, sorry. Fixing it

Replace custom form field implementation with the existing
K8sNameDescriptionField component to avoid code duplication.
This provides consistent UX for k8s resource name handling
including auto-generation and validation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Mike Turley <mike.turley@alum.cs.umass.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants