Skip to content

add eval for diagnosing concurrency lease renewal failures#98

Open
zzstoatzz wants to merge 2 commits into
mainfrom
claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78
Open

add eval for diagnosing concurrency lease renewal failures#98
zzstoatzz wants to merge 2 commits into
mainfrom
claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78

Conversation

@zzstoatzz

@zzstoatzz zzstoatzz commented Oct 20, 2025

Copy link
Copy Markdown
Collaborator

closes #97

summary

adds a new eval (test_lease_renewal_crash) for issue #97 - tests that an agent can diagnose flow runs that crashed due to concurrency lease renewal failures.

based on real user issues:

changes

  • new: evals/test_lease_renewal_crash.py - eval for concurrency lease renewal crash diagnosis
  • updated: README with new eval

test plan

  • test_lease_renewal_crash passes locally
  • CI passes

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented Oct 20, 2025

Copy link
Copy Markdown

📊 Observability

View eval run traces in Logfire: prefect-mcp-server-evals @ f099915

@github-actions

github-actions Bot commented Oct 20, 2025

Copy link
Copy Markdown

Evaluation Results

18 tests  +1   17 ✅ +1   4m 40s ⏱️ +47s
 1 suites ±0    0 💤 ±0 
 1 files   ±0    1 ❌ ±0 

For more details on these failures, see this check.

Results for commit f099915. ± Comparison against base commit 8833c0a.

♻️ This comment has been updated with latest results.

@zzstoatzz zzstoatzz marked this pull request as draft October 29, 2025 06:15
@zzstoatzz zzstoatzz changed the title Add eval for debugging concurrency lease renewal failures Add eval for debugging automation action validation failures Oct 29, 2025
@zzstoatzz zzstoatzz marked this pull request as ready for review October 29, 2025 18:11
@zzstoatzz zzstoatzz requested a review from desertaxle November 3, 2025 04:30
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 17c3933 to 676acd5 Compare November 4, 2025 00:43
@zzstoatzz zzstoatzz marked this pull request as draft December 31, 2025 03:37
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 676acd5 to ba0d8a8 Compare December 31, 2025 03:38
@zzstoatzz zzstoatzz changed the title Add eval for debugging automation action validation failures add eval for debugging concurrency lease renewal failures Dec 31, 2025
@zzstoatzz zzstoatzz changed the title add eval for debugging concurrency lease renewal failures add eval for diagnosing concurrency lease renewal failures Dec 31, 2025
@zzstoatzz zzstoatzz marked this pull request as ready for review December 31, 2025 16:35
Comment thread evals/test_create_proactive_automation.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little contrived to me. The set up makes this an eval that checks if an agent can correctly retrieve and read a state message, which I think is less interesting than evaluating it against the real scenario. IMO, making this look more like our concurrency lease integration tests would make this eval more interesting.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this eval should be closer to simulating a realistic situation. Actually crashing a run due to a failed lease renewal will help ensure that agents with the MCP server can debug the failure even with newer versions of Prefect.

@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch 4 times, most recently from d1f1f44 to 8bac451 Compare December 31, 2025 21:21
@zzstoatzz zzstoatzz requested a review from desertaxle December 31, 2025 21:28
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 611decf to 446e6a0 Compare May 21, 2026 15:50
closes #97

rewrites the lease renewal crash eval per @desertaxle's review feedback.
the previous draft monkey-patched the entire `_lease_renewal_loop` to
raise immediately, which was contrived.

this version only mocks the HTTP boundary (SyncPrefectClient.renew_concurrency_lease)
so prefect's real renewal loop, retry policy, error handling, and cancel
scope all execute end-to-end — mirroring the integration tests in
prefect/integration-tests/test_concurrency_leases.py.

the resulting CRASHED state and "Crash detected!" run log are exactly
what an agent diagnosing a production incident would see.

note: the "Concurrency lease renewal failed" message is currently emitted
via the `prefect.concurrency` python logger (renewal thread can't reach
get_run_logger()), so it does not land in the run logs API. the eval
prompt accommodates this — the agent is expected to investigate the
crash and surface concurrency-related causes, not extract a verbatim
error message that isn't there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zzstoatzz zzstoatzz force-pushed the claude/fix-issue-97-011CUJwm9Jqwg6cRwWdonV78 branch from 446e6a0 to c6d7bad Compare May 21, 2026 16:24
the conftest's prefect_test_harness is session-scoped, so prior tests'
flow runs sit in the same db. without a filter, read_flow_runs() returns
the most recent run across all tests — which, when another test runs a
scheduler-created deployment run right before this fixture, is the wrong
one (we'd see SCHEDULED instead of our CRASHED run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add eval for debugging concurrency lease renewal failures

2 participants