add eval for diagnosing concurrency lease renewal failures#98
add eval for diagnosing concurrency lease renewal failures#98zzstoatzz wants to merge 2 commits into
Conversation
📊 ObservabilityView eval run traces in Logfire: prefect-mcp-server-evals @ f099915 |
Evaluation Results18 tests +1 17 ✅ +1 4m 40s ⏱️ +47s For more details on these failures, see this check. Results for commit f099915. ± Comparison against base commit 8833c0a. ♻️ This comment has been updated with latest results. |
17c3933 to
676acd5
Compare
676acd5 to
ba0d8a8
Compare
There was a problem hiding this comment.
This feels a little contrived to me. The set up makes this an eval that checks if an agent can correctly retrieve and read a state message, which I think is less interesting than evaluating it against the real scenario. IMO, making this look more like our concurrency lease integration tests would make this eval more interesting.
There was a problem hiding this comment.
I still think this eval should be closer to simulating a realistic situation. Actually crashing a run due to a failed lease renewal will help ensure that agents with the MCP server can debug the failure even with newer versions of Prefect.
d1f1f44 to
8bac451
Compare
611decf to
446e6a0
Compare
closes #97 rewrites the lease renewal crash eval per @desertaxle's review feedback. the previous draft monkey-patched the entire `_lease_renewal_loop` to raise immediately, which was contrived. this version only mocks the HTTP boundary (SyncPrefectClient.renew_concurrency_lease) so prefect's real renewal loop, retry policy, error handling, and cancel scope all execute end-to-end — mirroring the integration tests in prefect/integration-tests/test_concurrency_leases.py. the resulting CRASHED state and "Crash detected!" run log are exactly what an agent diagnosing a production incident would see. note: the "Concurrency lease renewal failed" message is currently emitted via the `prefect.concurrency` python logger (renewal thread can't reach get_run_logger()), so it does not land in the run logs API. the eval prompt accommodates this — the agent is expected to investigate the crash and surface concurrency-related causes, not extract a verbatim error message that isn't there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
446e6a0 to
c6d7bad
Compare
the conftest's prefect_test_harness is session-scoped, so prior tests' flow runs sit in the same db. without a filter, read_flow_runs() returns the most recent run across all tests — which, when another test runs a scheduler-created deployment run right before this fixture, is the wrong one (we'd see SCHEDULED instead of our CRASHED run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
closes #97
summary
adds a new eval (
test_lease_renewal_crash) for issue #97 - tests that an agent can diagnose flow runs that crashed due to concurrency lease renewal failures.based on real user issues:
changes
evals/test_lease_renewal_crash.py- eval for concurrency lease renewal crash diagnosistest plan
test_lease_renewal_crashpasses locally🤖 Generated with Claude Code