add eval for diagnosing concurrency lease renewal failures by zzstoatzz · Pull Request #98 · PrefectHQ/prefect-mcp-server

zzstoatzz · 2025-10-20T18:50:13Z

closes #97

summary

adds a new eval (test_lease_renewal_crash) for issue #97 - tests that an agent can diagnose flow runs that crashed due to concurrency lease renewal failures.

based on real user issues:

Concurrency lease renewal failed - slots are no longer reserved. Terminating execution to prevent over-allocation. prefect#19068
Infrastructure stays running after Flow is marked Crashed by failed Lease renewal prefect#18839

changes

new: evals/test_lease_renewal_crash.py - eval for concurrency lease renewal crash diagnosis
updated: README with new eval

test plan

test_lease_renewal_crash passes locally
CI passes

🤖 Generated with Claude Code

github-actions · 2025-10-20T18:50:40Z

📊 Observability

View eval run traces in Logfire: prefect-mcp-server-evals @ f099915

github-actions · 2025-10-20T19:04:16Z

Evaluation Results

18 tests +1 17 ✅ +1 4m 40s ⏱️ +47s
1 suites ±0 0 💤 ±0
1 files ±0 1 ❌ ±0

For more details on these failures, see this check.

Results for commit f099915. ± Comparison against base commit 8833c0a.

♻️ This comment has been updated with latest results.

desertaxle · 2025-12-31T16:48:22Z

This feels a little contrived to me. The set up makes this an eval that checks if an agent can correctly retrieve and read a state message, which I think is less interesting than evaluating it against the real scenario. IMO, making this look more like our concurrency lease integration tests would make this eval more interesting.

I still think this eval should be closer to simulating a realistic situation. Actually crashing a run due to a failed lease renewal will help ensure that agents with the MCP server can debug the failure even with newer versions of Prefect.

@desertaxle

closes #97 rewrites the lease renewal crash eval per @desertaxle's review feedback. the previous draft monkey-patched the entire `_lease_renewal_loop` to raise immediately, which was contrived. this version only mocks the HTTP boundary (SyncPrefectClient.renew_concurrency_lease) so prefect's real renewal loop, retry policy, error handling, and cancel scope all execute end-to-end — mirroring the integration tests in prefect/integration-tests/test_concurrency_leases.py. the resulting CRASHED state and "Crash detected!" run log are exactly what an agent diagnosing a production incident would see. note: the "Concurrency lease renewal failed" message is currently emitted via the `prefect.concurrency` python logger (renewal thread can't reach get_run_logger()), so it does not land in the run logs API. the eval prompt accommodates this — the agent is expected to investigate the crash and surface concurrency-related causes, not extract a verbatim error message that isn't there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

the conftest's prefect_test_harness is session-scoped, so prior tests' flow runs sit in the same db. without a filter, read_flow_runs() returns the most recent run across all tests — which, when another test runs a scheduler-created deployment run right before this fixture, is the wrong one (we'd see SCHEDULED instead of our CRASHED run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

zzstoatzz marked this pull request as draft October 29, 2025 06:15

zzstoatzz changed the title ~~Add eval for debugging concurrency lease renewal failures~~ Add eval for debugging automation action validation failures Oct 29, 2025

zzstoatzz marked this pull request as ready for review October 29, 2025 18:11

zzstoatzz requested a review from desertaxle November 3, 2025 04:30

zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 17c3933 to 676acd5 Compare November 4, 2025 00:43

zzstoatzz marked this pull request as draft December 31, 2025 03:37

zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 676acd5 to ba0d8a8 Compare December 31, 2025 03:38

zzstoatzz changed the title ~~Add eval for debugging automation action validation failures~~ add eval for debugging concurrency lease renewal failures Dec 31, 2025

zzstoatzz mentioned this pull request Dec 31, 2025

fix flaky test_create_proactive_automation eval #118

Closed

zzstoatzz changed the title ~~add eval for debugging concurrency lease renewal failures~~ add eval for diagnosing concurrency lease renewal failures Dec 31, 2025

zzstoatzz marked this pull request as ready for review December 31, 2025 16:35

desertaxle reviewed Dec 31, 2025

View reviewed changes

zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch 4 times, most recently from d1f1f44 to 8bac451 Compare December 31, 2025 21:21

zzstoatzz requested a review from desertaxle December 31, 2025 21:28

zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 611decf to 446e6a0 Compare May 21, 2026 15:50

zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 446e6a0 to c6d7bad Compare May 21, 2026 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add eval for diagnosing concurrency lease renewal failures#98

add eval for diagnosing concurrency lease renewal failures#98
zzstoatzz wants to merge 2 commits into
mainfrom
claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78

zzstoatzz commented Oct 20, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

desertaxle Dec 31, 2025

Uh oh!

desertaxle Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zzstoatzz commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

summary

changes

test plan

Uh oh!

github-actions Bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Observability

Uh oh!

github-actions Bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluation Results

Uh oh!

Uh oh!

desertaxle Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

desertaxle Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zzstoatzz commented Oct 20, 2025 •

edited

Loading

github-actions Bot commented Oct 20, 2025 •

edited

Loading

github-actions Bot commented Oct 20, 2025 •

edited

Loading