Skip to content

feat(backend): add retry attempt metadata to MLMD executions#12904

Open
jeffspahr wants to merge 8 commits intokubeflow:masterfrom
jeffspahr:codex/resolve-12872
Open

feat(backend): add retry attempt metadata to MLMD executions#12904
jeffspahr wants to merge 8 commits intokubeflow:masterfrom
jeffspahr:codex/resolve-12872

Conversation

@jeffspahr
Copy link
Contributor

@jeffspahr jeffspahr commented Feb 25, 2026

Summary

  • Populates max_retry_count and retry_count as int custom properties on MLMD executions so the frontend can display retry visibility (companion to feat(#12169): Show retry attempts for pipeline tasks in UI #12596)
  • The driver writes max_retry_count at execution creation time from the task's retry policy; the launcher writes retry_count (current attempt) at pre-publish time via the KFP_RETRY_COUNT env var injected by the Argo compiler
  • Includes compiler tests verifying the env var is present/absent and that the package-level commonEnvs slice is not mutated
  • Also included (from branch history): frontend cleanup removing dead deps (request, coveralls, swagger-ts-client), the unused build:replace script, and the Coveralls badge

Test plan

  • go build ./... passes
  • go test ./compiler/argocompiler/... — 3 new tests pass (env var present, absent, no mutation)
  • go test ./driver/... ./component/... ./metadata/... — all existing tests pass
  • Frontend PR feat(#12169): Show retry attempts for pipeline tasks in UI #12596 should read retry_count / max_retry_count with getIntValue() (not getStringValue())

Remove three unused dependencies that collectively pull in the
deprecated `request` package and its large transitive tree.

## `request` (direct dependency)

Used only by `frontend/scripts/replace_protos.js`, a script that
downloads MLMD proto files from GitHub via `npm run build:replace`.

Evidence the script is dead:
- Only one file in the repo imports `request`: this script (line 19)
- The script was last modified in commit e035a88 (Aug 2021) to bump
  MLMD to v1.2.0
- The most recent MLMD upgrade (commit eaec515, Aug 2023, bumping to
  v1.14.0) updated the proto files directly in git without running
  this script
- `build:replace` is not referenced in any CI workflow, Makefile,
  or other script
- The proto files the script manages (third_party/ml-metadata/...)
  are checked directly into git and updated manually

## `coveralls` (devDependency)

Submits test coverage data to the Coveralls service.

Evidence it is dead:
- The two scripts that used it were deleted in commit c7f8729
  (Sep 2024, "chore: Remove unwanted Frontend test files kubeflow#10973"):
  `frontend/scripts/report-coveralls.sh` and
  `frontend/scripts/get-coveralls-repo-token.js`
- The npm script that invoked them (`test:ci:prow`) was also removed
  in that commit
- No current npm script, CI workflow, or source file references
  coveralls
- No .coveralls.yml config file exists

## `swagger-ts-client` (devDependency)

Generates TypeScript API clients from Swagger specs.

Evidence it is dead:
- Present since the initial commit 633e2dd but was never actually
  used — the repo has always used Java-based swagger-codegen-cli.jar
  for API client generation (see apis:* scripts in package.json)
- Zero imports, references, npm scripts, or config files for
  swagger-ts-client exist anywhere in the repo

## Changes

- Delete `frontend/scripts/replace_protos.js`
- Remove `request`, `coveralls`, `swagger-ts-client` from
  package.json
- Remove `build:replace` npm script
- Regenerate package-lock.json (64 transitive packages pruned)
- Remove `build:replace` documentation from CONTRIBUTING.md

## Verification

- npm ci: passes
- npm run lint: passes
- npm run typecheck: passes
- npm run test: 117/117 files, 1652/1652 tests pass
- npm ls request: confirms request is no longer in the tree

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Address Copilot review feedback:
- Remove `npm run build:replace` reference from
  third_party/ml-metadata/README.md, point to build:protos instead
- Remove dead Coveralls badge from root README.md (service no longer
  used since kubeflow#10973)

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
The previous lockfile was generated with Node 25 / npm 11, which
produces a different resolution format. CI uses Node 22.19.0 / npm
10.9.3 (per .nvmrc), causing npm ci to fail with missing entries
(e.g. babel-plugin-macros@3.1.0).

Regenerated with the correct toolchain version.

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Resolve frontend/package-lock.json conflict by regenerating the lockfile with Node 22.19.0 / npm 10.9.3 after applying upstream master changes.

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Use the current master lockfile as baseline and regenerate lock metadata under Node 22.19.0/npm 10 so only this PR's dependency removals remain. This avoids unintended transitive upgrades that caused frontend TypeScript failures in CI.

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Populate max_retry_count and retry_count as int custom properties on
MLMD executions so the frontend can display retry visibility (PR kubeflow#12596).

- metadata/client.go: add MaxRetryCount/RetryCount fields to
  ExecutionConfig; write them in CreateExecution and PrePublishExecution
- driver/container.go: read retry policy max_retry_count at execution
  creation time
- compiler/argocompiler/container.go: inject KFP_RETRY_COUNT={{retries}}
  env var into the impl template when retry is configured
- component/constants.go: add EnvRetryCount constant
- component/launcher_v2.go: read KFP_RETRY_COUNT env in prePublish and
  set retry_count on the execution config
- compiler/argocompiler/container_test.go: verify env var presence,
  absence, and that commonEnvs is not mutated

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Copilot AI review requested due to automatic review settings February 25, 2026 07:34
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zazulam for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds retry-attempt visibility to ML Metadata (MLMD) executions for KFP v2 so the UI can display retry information, by persisting both the configured max retries and the current retry counter. Also includes some frontend/proto-tooling cleanup (removed deprecated scripts/deps and coverage badge).

Changes:

  • Persist max_retry_count on MLMD execution creation (from task retry policy) and retry_count during pre-publish (from KFP_RETRY_COUNT).
  • Argo compiler injects KFP_RETRY_COUNT={{retries}} when a task has a retry policy, and adds tests around env var presence and commonEnvs behavior.
  • Removes legacy frontend proto replacement script and prunes related frontend dependencies / Coveralls badge.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
backend/src/v2/metadata/client.go Adds max_retry_count / retry_count MLMD custom properties support via ExecutionConfig.
backend/src/v2/driver/container.go Populates MaxRetryCount from the task retry policy when creating executions.
backend/src/v2/component/launcher_v2.go Reads KFP_RETRY_COUNT and writes retry_count at pre-publish time.
backend/src/v2/component/constants.go Introduces EnvRetryCount constant.
backend/src/v2/compiler/argocompiler/container.go Injects KFP_RETRY_COUNT={{retries}} into container templates when retry is configured.
backend/src/v2/compiler/argocompiler/container_test.go Adds compiler tests for retry env var and commonEnvs non-mutation.
frontend/scripts/replace_protos.js Removes legacy proto replacement script.
frontend/package.json Removes build:replace and drops unused/deprecated deps (e.g., request, coveralls, swagger-ts-client).
frontend/package-lock.json Lockfile updates reflecting removed deps.
frontend/CONTRIBUTING.md Removes documentation for the deleted build:replace workflow.
third_party/ml-metadata/README.md Updates MLMD JS proto regeneration docs link/instructions.
README.md Removes Coveralls badge.
Files not reviewed (1)
  • frontend/package-lock.json: Language not supported

- Clarify RetryCount comment as 0-based counter (matches Argo {{retries}})
- Log warning at V(2) when KFP_RETRY_COUNT parse fails instead of
  silently dropping retry metadata
- Deep-compare commonEnvs contents in mutation test instead of just
  checking length

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Add the KFP_RETRY_COUNT={{retries}} env var to the expected compiled
workflow for pipeline_with_retry, matching the new compiler behavior.

Signed-off-by: Jeff Spahr <spahrj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants