Summary
This is the #1 user request. After running evaluations, users see scores like "Groundedness: 0.4, Context Relevance: 0.6" but don't know why scores are low or what to change. The existing skills stop at "Running Evaluations" — there's no skill that closes the loop from eval results to actionable improvements.
What
Create skills/diagnosis/SKILL.md that guides the user through:
Triage
Pull low-scoring records via session.get_records_and_feedback(), filter to failing metrics, and identify patterns (which feedback functions fail most? which queries? which app versions?)
Root Cause Analysis
For each failing metric, inspect the OTEL trace to find the problematic span — e.g., low groundedness points to retrieval issues (bad chunks, wrong k), low tool_selection points to tool descriptions or routing logic
Actionable Recommendations
Based on the failure pattern, suggest concrete fixes:
- Low
context_relevance → adjust chunk size, overlap, embedding model, or retrieval k
- Low
groundedness → add source attribution instructions to system prompt, filter irrelevant chunks
- Low
tool_selection → improve tool descriptions, add few-shot examples to agent prompt
- Low
plan_adherence → simplify plan structure, add explicit step validation
- Low
coherence → adjust temperature, add output formatting instructions
Re-evaluate
After making changes, re-run evals on the same dataset and compare versions using session.get_leaderboard() or the Compare tab
Regression Check
Ensure fixes for one metric didn't degrade another
Reference
Existing skills live in skills/ as structured Markdown files (SKILL.md) following spec version 0.1.0.
Difficulty
Medium
Summary
This is the #1 user request. After running evaluations, users see scores like "Groundedness: 0.4, Context Relevance: 0.6" but don't know why scores are low or what to change. The existing skills stop at "Running Evaluations" — there's no skill that closes the loop from eval results to actionable improvements.
What
Create
skills/diagnosis/SKILL.mdthat guides the user through:Triage
Pull low-scoring records via
session.get_records_and_feedback(), filter to failing metrics, and identify patterns (which feedback functions fail most? which queries? which app versions?)Root Cause Analysis
For each failing metric, inspect the OTEL trace to find the problematic span — e.g., low groundedness points to retrieval issues (bad chunks, wrong k), low tool_selection points to tool descriptions or routing logic
Actionable Recommendations
Based on the failure pattern, suggest concrete fixes:
context_relevance→ adjust chunk size, overlap, embedding model, or retrieval kgroundedness→ add source attribution instructions to system prompt, filter irrelevant chunkstool_selection→ improve tool descriptions, add few-shot examples to agent promptplan_adherence→ simplify plan structure, add explicit step validationcoherence→ adjust temperature, add output formatting instructionsRe-evaluate
After making changes, re-run evals on the same dataset and compare versions using
session.get_leaderboard()or the Compare tabRegression Check
Ensure fixes for one metric didn't degrade another
Reference
Existing skills live in
skills/as structured Markdown files (SKILL.md) following spec version 0.1.0.Difficulty
Medium