Skip to content

Add Diagnosis and Improvement Agent Skill #2422

@joshreini1

Description

@joshreini1

Summary

This is the #1 user request. After running evaluations, users see scores like "Groundedness: 0.4, Context Relevance: 0.6" but don't know why scores are low or what to change. The existing skills stop at "Running Evaluations" — there's no skill that closes the loop from eval results to actionable improvements.

What

Create skills/diagnosis/SKILL.md that guides the user through:

Triage

Pull low-scoring records via session.get_records_and_feedback(), filter to failing metrics, and identify patterns (which feedback functions fail most? which queries? which app versions?)

Root Cause Analysis

For each failing metric, inspect the OTEL trace to find the problematic span — e.g., low groundedness points to retrieval issues (bad chunks, wrong k), low tool_selection points to tool descriptions or routing logic

Actionable Recommendations

Based on the failure pattern, suggest concrete fixes:

  • Low context_relevance → adjust chunk size, overlap, embedding model, or retrieval k
  • Low groundedness → add source attribution instructions to system prompt, filter irrelevant chunks
  • Low tool_selection → improve tool descriptions, add few-shot examples to agent prompt
  • Low plan_adherence → simplify plan structure, add explicit step validation
  • Low coherence → adjust temperature, add output formatting instructions

Re-evaluate

After making changes, re-run evals on the same dataset and compare versions using session.get_leaderboard() or the Compare tab

Regression Check

Ensure fixes for one metric didn't degrade another

Reference

Existing skills live in skills/ as structured Markdown files (SKILL.md) following spec version 0.1.0.

Difficulty

Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    Agent SkillsNew or improved agent skills for AI coding assistantsenhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions