Skip to content

[MCP Integration] Phase 6: Job management service#2239

Merged
aniruddh-alt merged 10 commits intoani/mcp-integration-05-docs-svcfrom
ani/mcp-integration-06-job-svc
Mar 24, 2026
Merged

[MCP Integration] Phase 6: Job management service#2239
aniruddh-alt merged 10 commits intoani/mcp-integration-05-docs-svcfrom
ani/mcp-integration-06-job-svc

Conversation

@aniruddh-alt
Copy link
Copy Markdown
Contributor

Description

Part of the MCP Integration PR chain (Phase 6 of 10) - Stage: service

What changed: Added job_service.py for job submission (local subprocess + cloud via oumi.launcher), status polling, cancellation, and async log streaming. Includes JobRegistry for tracking active jobs and registry tests.

Why: Job management is the core execution capability of the MCP server — it enables launching, monitoring, and controlling ML training jobs.

Related issues

Before submitting

  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Mar 3, 2026

Important

Upgrade your plan to unlock code review, CI analysis, custom rules, and more.

@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-05-docs-svc branch from 03c5392 to 0d68b18 Compare March 4, 2026 00:04
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-06-job-svc branch from 2a4d610 to 6602b6b Compare March 4, 2026 00:04
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-05-docs-svc branch 2 times, most recently from fae7907 to aaed3bb Compare March 4, 2026 00:44
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-06-job-svc branch from 6602b6b to 11b574c Compare March 4, 2026 00:45
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-05-docs-svc branch from aaed3bb to 9e7b6a2 Compare March 4, 2026 00:50
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-06-job-svc branch from 11b574c to 8a4fe87 Compare March 4, 2026 00:50
@aniruddh-alt aniruddh-alt marked this pull request as ready for review March 8, 2026 16:50
Comment on lines +502 to +505
except Exception as exc:
rt.error_message = str(exc)
logger.exception("Failed to launch cloud job %s", record.job_id)
return ""

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


return removed

def _save(self) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have the MCP server store this in somewhere less temporary? I'm thinking about what would happen if the MCP restarts, it'd be nice to not lose track of what jobs were submitted.


"""Job management service for Oumi MCP execution tools.

Provides job submission, status polling, cancellation, and log streaming
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is getting a bit overwhelming in implementation details - could we move job handling and logs to separate .py files and have this service take them in as dependencies in its constructor? That will clean up a lot of the excess private methods and logic being brought in here when ideally we'd want to keep this file clean since it's acting as the interface.

@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-05-docs-svc branch from d84edd0 to 5a9cd18 Compare March 24, 2026 20:35
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-06-job-svc branch from fa46a23 to b97b16f Compare March 24, 2026 20:49
Comment on lines +518 to +526
except TimeoutError:
return {
"success": False,
"error": (
f"Cancel timed out after 30s "
f"(cloud={cloud}, cluster={cluster_name}, id={job_id}). "
"The cancellation may still be in progress. "
"Check cloud console or retry."
),

This comment was marked as outdated.

Comment on lines +203 to +204
except asyncio.TimeoutError:
raw = "".join(chunks)

This comment was marked as outdated.

Aniruddhan Ramesh and others added 9 commits March 24, 2026 14:40
Add job_service.py for job submission (local subprocess + cloud via
oumi.launcher), status polling, cancellation, and log streaming.
Includes JobRegistry for tracking active jobs and registry tests.

Part of the MCP integration PR chain (Phase 6 of 10).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use asyncio.Lock for _runtimes dict concurrency protection
- Ensure subprocesses are killed and reaped on cancellation
- Use atomic write pattern in JobRegistry._save()
- Cap unbounded lines parameter in log streaming
Pydantic requires typing_extensions.TypedDict (not typing.TypedDict)
on Python < 3.12 when NotRequired fields are present.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-05-docs-svc branch from 5a9cd18 to eb40707 Compare March 24, 2026 21:41
@aniruddh-alt aniruddh-alt force-pushed the ani/mcp-integration-06-job-svc branch from b97b16f to ad58177 Compare March 24, 2026 21:41
Co-authored-by: Aniruddhan Ramesh <aniruddhanramesh@Aniruddhans-MacBook-Pro.local>
@aniruddh-alt aniruddh-alt merged commit 0a8f7ea into ani/mcp-integration-05-docs-svc Mar 24, 2026
@aniruddh-alt aniruddh-alt deleted the ani/mcp-integration-06-job-svc branch March 24, 2026 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants