[feat] Extend `queues` by mmabrouk · Pull Request #3814 · Agenta-AI/agenta

mmabrouk · 2026-02-24T13:41:50Z

Summary

Design workspace for the annotation queue v2 feature.

context.md — background, problem statement, goals
prd.md — product requirements, user stories, acceptance criteria
rfc.md — technical RFC with three solution approaches (A: extend eval runs, B: new domain, C: metadata-based) — recommends C for v1
research.md — analysis of the existing EvaluationQueue backend implementation
research-human-eval-implementation.md — how human evaluation works today end-to-end (frontend components, state, API calls, annotation storage, backend service logic)
competitive-analysis.md — competitor metadata-based approach analysis

- competitive-analysis.md: anonymized analysis of competitor's approach - rfc.md: added Solution C using metadata-based queues (no new tables) - Updated recommendation: Solution C for v1 (1-2 weeks vs 4-5 weeks) Key insight: annotations and review status can be stored as metadata on existing items, with queues as filtered views rather than entities.

vercel · 2026-02-24T13:41:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Feb 27, 2026 9:21am

Proposes building annotation queues as a convenience layer over existing EvaluationRun + EvaluationQueue entities. No new domain entities — the problem is the interface, not the data model. Covers: trace annotation, test set annotation, human+auto eval, convenience API design, UI direction (view swap + inbox), phased implementation plan.

mmabrouk · 2026-02-24T19:54:10Z

docs/design/annotation-queue-v2/rfc-v2.md

+**What happens behind the scenes:**
+
+1. User selects traces in observability view, clicks "Send to review"
+2. User picks evaluators (what to annotate) and optionally assigns people


This is not the correct flow. The flow is that the user would send the trace to an annotation queue, or create a new one, if they decide to create a new one they can configure it with evaluators, people, number of annoations etc..

The convenience API auto-creates the data for that annotation queue (and also needs to delete it and its data when asked to)

Updated the flow. Now it's:

Select traces → "Send to annotation queue"

Choose existing queue OR create new one (configure labels, assign people, set repeats)

Convenience API auto-creates backing infrastructure

Also added deletion handling — the convenience API should clean up EvaluationQueue + Scenarios + Results + Run when a queue is deleted (but NOT the immutable OTel annotation spans).

mmabrouk · 2026-02-24T19:56:42Z

docs/design/annotation-queue-v2/rfc-v2.md

+   - The run has no inputs (no testset, no query) — just annotation steps with `origin: "human"` for each evaluator
+   - One EvaluationScenario per selected trace, with the trace's `trace_id` stored as the invocation reference in the scenario (no separate invocation step needed)
+   - An EvaluationQueue linked to the run, with user assignments if specified
+4. Annotator opens inbox → sees assigned traces → annotates → submits


It would be alright to have the user see the annotation queues they are assigned to and not the inbox directly (which could go to v2)

Agreed. Rewrote the section as "Annotation Queues Page" — a dedicated page showing queues assigned to the current user, with progress (X/Y done), labels, and source type. Each queue has an "Open" action.

The global inbox (flat list of pending items across queues) is explicitly marked as v2.

mmabrouk · 2026-02-24T19:58:12Z

docs/design/annotation-queue-v2/rfc-v2.md

+   - `POST /preview/annotations/` creates the annotation OTel span (same as today)
+   - `PATCH /preview/evaluations/results/` links the annotation `trace_id` to the step result (same as today)
+   - The annotation is also visible on the trace span in observability (existing write-through via OTel links)
+


How does the FE discovers:

Which annotations has been done and still open?

How far are we from done?

Can the user edit their annotation after it being done?

Added a "Frontend: tracking progress and status" section addressing all three questions:

Per-item status: Each EvaluationResult has a status (PENDING/COMPLETED). FE queries results for the queue's scenarios to determine done vs open.

Overall progress: Count of COMPLETED results vs total (scenarios × annotation steps). Convenience API exposes this on the queue detail endpoint.

Editing after completion: Yes. Re-submitting creates a new annotation OTel span (append-only) and updates the EvaluationResult.trace_id to the latest. Previous annotations are preserved.

mmabrouk · 2026-02-24T19:59:13Z

docs/design/annotation-queue-v2/rfc-v2.md

+   - `PATCH /preview/evaluations/results/` links the annotation `trace_id` to the step result (same as today)
+   - The annotation is also visible on the trace span in observability (existing write-through via OTel links)
+
+**Key design choice: evaluations without inputs.** The run has no input steps. The trace being annotated is referenced as the invocation in the scenario. This requires backend support for runs where only invocation references exist (no testset inputs).


what does need to be changed for this to be possible. let's run a research and add a note there in the appendix and link it from here

Done! Added Appendix A: Evaluations Without Inputs — Technical Analysis to the RFC with detailed findings:

Good news: The data model already supports it — testcase_id and trace_id are both nullable, scenarios have no input fields.

What blocks it: 3 things:

The start() gate requires query_steps or testset_steps — without either, the run is never dispatched

The batch worker (evaluate_batch_testset) assumes testsets and would crash without them

No trace-only worker exists

Recommended approach: Model a new worker after the live eval flow (evaluate_live_query) which already handles trace-based scenarios with testcase=None. Just swap the query-discovery step for a pre-provided list of trace_ids.

Estimate: ~2-3 days backend work.

mmabrouk · 2026-02-24T20:03:14Z

docs/design/annotation-queue-v2/rfc-v2.md

+**What happens behind the scenes:**
+
+1. User opens a test set, clicks "Annotate" (or "Send to review")
+2. User picks evaluators (what to annotate) — or defines them inline:


From a UX perspective, it does not make sense for the user here to pick evaluators, instead they should specify which label they want to have in the annotation (we would offer defaults like correct_answer judge_guidelines). The FE or BE should create evaluators based on that.

Agreed — updated the flow. The user now specifies labels ("What do you want annotators to provide?") with sensible defaults like correct_answer, quality_rating, judge_guidelines. The FE/BE auto-creates a human evaluator with a matching JSON schema behind the scenes.

The user never sees or picks "evaluators" — they think in terms of labels and fields.

mmabrouk · 2026-02-24T20:04:44Z

docs/design/annotation-queue-v2/rfc-v2.md

+**Key design choice: annotating ≠ modifying the test set.** The annotation step creates annotation traces (OTel spans). These reference the test cases but don't modify them. Writing back to the test set is a separate, explicit action that creates a new revision. This preserves test case immutability and versioning.
+
+**Constraint:** Test cases are immutable today — changing content creates new IDs, and changes only stick when attached to a revision. The write-back step must respect this by creating a new revision, not mutating existing test cases.
+


Given that we will be creating a queue for each test set review. What happens when they are done, do we show them somewhere as finished? or hide them

Good question. Proposed approach:

Queue states: Active → Completed → (optionally) Archived

Active: Has pending items. Shown prominently on the Annotation Queues page.

Completed: All items annotated (100% progress). Still visible on the page but visually de-emphasized (e.g., grayed out, moved to a "Completed" section/tab).

Archived: User explicitly archives. Hidden from default view but accessible via a filter.

The state is derived from progress (not a manual status toggle): a queue is "completed" when all scenarios × annotation steps have COMPLETED results. No lifecycle state machine needed — it's just a computed property.

For test set annotation queues specifically: the queue shows as completed, and the user can then click "Write back to test set" to create a new revision. The queue itself stays visible as a record of the annotation work done.

Will add this to the RFC's Key Design Decisions section.

mmabrouk · 2026-02-24T20:06:54Z

docs/design/annotation-queue-v2/rfc-v2.md

+   - An EvaluationQueue with optional user assignments
+4. Annotator works through rows → fills in labels → submits
+5. On submit: same annotation creation + result linking as today
+6. **Write-back step** (separate action): User clicks "Save annotations to test set" → creates a new test set revision with annotation values as new columns


Who clicks on that? Is it the annotator? What if there are many? Is it part of the configuration of the queue so that it happens automatically on submit?

Would it be possible to have this same action for annotation queues for traces? Adding the annotated traces as test cases? [competitor offer a UX for this]

Three questions answered:

1. Who clicks write-back?
The queue creator/admin, not the annotator. When the queue is completed (or partially done), the admin goes to the queue detail view and clicks "Write back to test set." This is NOT per-annotator — it's a one-time action that collects all annotations and creates a new test set revision.

2. Can it be automatic?
Could be a queue config option: on_complete: "auto_write_back". When the last item is annotated, automatically create the new revision. But this adds complexity — for v1, let's keep it manual (explicit button). Can revisit for v2.

3. Can annotated traces be saved as test cases?
Yes! This is a great feature (competitor does offer this). The flow would be:

User annotates traces in a queue

On completion, clicks "Save as test set" → creates a new test set where each trace becomes a test case row

The trace's inputs become test case inputs, the trace's outputs become expected outputs, and the annotations become additional columns

This is essentially the same write-back mechanism but creating a new test set instead of updating an existing one. The convenience API could support both: write_back_target: "existing_testset"" | "new_testset".

Will add this to the RFC.

mmabrouk · 2026-02-24T20:07:40Z

docs/design/annotation-queue-v2/rfc-v2.md

+   - "What do you want to add?" → "Expected answer (text), Difficulty (1-5)"
+   - Behind the scenes: a human evaluator is auto-created with a JSON schema matching these fields
+3. **Convenience API** auto-creates:
+   - An EvaluationRun linked to the test set revision, with annotation steps for each evaluator


Is this feasible with today's evaluation run interface? would that break something?

Yes, this is feasible with today's interface. The EvaluationRunData stores steps as JSONB, and testset input steps already reference testset_revision in the step references. Creating an EvaluationRun linked to a testset revision works today — the batch worker (evaluate_batch_testset) already fetches testcases from the revision and creates scenarios.

The part that doesn't exist is creating such a run purely for annotation (no auto evaluators, no app invocation). That requires the changes described in Appendix A — mainly the start() gate and a trace-less worker path.

mmabrouk · 2026-02-24T20:08:14Z

docs/design/annotation-queue-v2/rfc-v2.md

+
+**What happens behind the scenes:**
+
+1. User creates evaluation run with both auto and human evaluators (existing flow)


How can we make the flow clear? The UX obvious and the feature discoverable?

Updated the RFC with concrete discoverability mechanisms:

Eval run detail view: Shows a prominent banner when human annotation is pending — "This evaluation has X human annotation tasks. [Go to annotation queue]"

Annotation Queues page: Lists all queues including ones auto-created from eval runs, with the run name as context

Eval run list: Runs with pending human annotation should show a distinct status (not just ERRORS as today)

The key insight: the orchestrator should handle mixed runs natively (skip human steps, seed as PENDING, create queue). The user doesn't need to do anything special — creating an eval run with human evaluators automatically surfaces annotation tasks.

mmabrouk · 2026-02-24T20:09:06Z

docs/design/annotation-queue-v2/rfc-v2.md

+**What happens behind the scenes:**
+
+1. User creates evaluation run with both auto and human evaluators (existing flow)
+2. Auto evaluators execute immediately (existing flow)


What happens when an evaluation run is created as of now (using the automatic evaluation orchestrator) with human evaluator steps? Does it get stuck? Does it ignore the?

Researched this thoroughly. It fails.

When the orchestrator encounters human evaluator steps, it attempts to invoke them via workflows_service.invoke_workflow(). Human evaluators have no registered handler (no uri — only a JSON schema in data.service.format), so the invocation raises InvalidInterfaceURIV0Error.

The result:

Human evaluator steps are recorded with status=FAILURE and an error payload

The scenario is marked as ERRORS

The run is marked as ERRORS (not fully failed since auto evaluators may have succeeded)

No EvaluationQueue is created — queue creation is not part of the orchestration pipeline at all

There are zero is_human checks anywhere in the evaluation orchestration code

So for mixed auto+human runs, the auto evaluators succeed and the human ones fail. The has_human flag is computed and stored on run flags, but never acted upon.

What's needed: The orchestrator should check step.origin == "human" before invoking, skip invocation for human steps, seed their results as PENDING, and optionally create an EvaluationQueue for those steps.

mmabrouk · 2026-02-24T20:12:58Z

docs/design/annotation-queue-v2/rfc-v2.md

+
+1. User creates evaluation run with both auto and human evaluators (existing flow)
+2. Auto evaluators execute immediately (existing flow)
+3. Human evaluator steps → an EvaluationQueue is auto-created for the run


The evaluation run orchestrator is the responsible for the creation of the steps. Does this mean that the convenience API would schedule an evaluation run and then wait for the steps to be created so that it can create an evaluation queue that specifies the human evaluation steps that are related to this.
Or that we need to change the implementation for the automatic evaluation orchestration to include this itself (basically it creates the steps then finds the human evaluators and then creates the queue based on it).

Side note: how does the online eval work in this case, maybe there is something to learn from it structure

Great question. Updated the RFC — the evaluation orchestrator creates the queue, not the convenience API. Here's the reasoning:

The orchestrator should be modified to:

Check step.origin == "human" before invoking

Skip invocation for human steps, seed results as PENDING

After processing all steps, detect has_human and create the EvaluationQueue

This is cleaner than having the convenience API schedule a run and wait because:

The orchestrator already knows which steps are human and which scenarios exist

No polling/waiting needed

Follows the structure of the online eval flow

Re: online eval structure — good call. The live eval worker (evaluate_live_query in live.py) is relevant: it creates scenarios from traces (not testcases), fetches trace data, and runs evaluators per scenario. The trace-only annotation worker should be modeled after this pattern. The live flow already handles testcase=None, inputs=None — exactly what we need for annotation-only runs.

mmabrouk · 2026-02-24T20:19:30Z

docs/design/annotation-queue-v2/rfc-v2.md

+  ],
+
+  // Items to annotate (one of these)
+  "source": {


This is a very weird design imo. Why is it needed?

For instance why is the run_id needed? It seems that this layer is not needed for run_id since the orchestration layer for evaluation would probably deal with the creation of the annotation queue in case there are human evaluation (cc @junaway @jp-agenta question mark)

Very likely the only thing that is needed is whether the source is traces of test cases.
In the case of testset, we need a revision id to save back (although if we do it case by case, then we can save a new version of that test case from where it comes from [ not sure if a test case id tells us where it is from, probably not ]

The trace_ids for initialization make sense as optional

You're right — completely reworked the API design.

Dropped run_id as a source type. For eval runs, the orchestrator creates the queue directly. The convenience API only handles the two explicit cases: traces and testset.

Source types are now just traces and testset. The trace_ids field is optional (for initialization — you can add items later).

Also changed from evaluators to labels as the primary interface — users define what they want ("correctness: boolean", "quality: 1-5") and we auto-create evaluators behind the scenes. Can also reference existing evaluator slugs for power users.

mmabrouk · 2026-02-24T20:20:27Z

docs/design/annotation-queue-v2/rfc-v2.md

+```
+POST /preview/annotation-queues/{queue_id}/items
+{
+  "trace_ids": ["new-trace-1", "new-trace-2"]


what if test cases? do we validate?

Updated the "Add Items" endpoint to handle both traces and test cases. For testset-sourced queues, you can add specific testcase_ids.

Validation: Yes, the endpoint validates that the item type matches the queue's source type. Can't add traces to a testset-sourced queue or vice versa.

mmabrouk · 2026-02-24T20:21:24Z

docs/design/annotation-queue-v2/rfc-v2.md

+### Write Back to Test Set
+
+```
+POST /preview/annotation-queues/{queue_id}/write-back


Agreed — completely reworked. Renamed to /export and it now supports two targets:

testset_revision: For testset-sourced queues. Creates a new revision with annotation values as new columns. User provides a column_mapping to control naming.

new_testset: For trace-sourced queues. Creates a new test set from annotated traces (trace inputs → test case inputs, annotations → columns). This is the "save annotated traces as test cases" feature.

The export is triggered by the queue admin, not individual annotators.

mmabrouk · 2026-02-24T20:21:48Z

docs/design/annotation-queue-v2/rfc-v2.md

+
+Creates a new test set revision with annotation values as new columns.
+
+---


There is missing discussion about the frontend and how it would interact with this in every place

Rewrote the entire UI section with explicit frontend interaction details for each use case:

Observability: Select traces → "Send to annotation queue" modal → POST to convenience API → queue appears on Annotation Queues page

Test set: "Send to annotation queue" button → configure labels → POST to convenience API → queue appears → on completion, admin exports to test set

Eval run: Orchestrator auto-creates queue → banner on eval run detail: "This evaluation has human annotation tasks" → queue appears on page → results flow back to eval table

Also specified that the annotation view is the same for all source types — one view, multiple data types.

This is not what I meant only. But basically how the annotation view works right now (in human evaluation) vs. how it would work. What needs to change

Added Appendix B: Annotation View — Current State vs Proposed Changes to the RFC with a thorough comparison.

How it works today (2 separate contexts):

Trace drawer (AnnotateDrawer): 400px side drawer with ONLY the annotation form. No inputs/outputs shown. User selects evaluators manually. Ad-hoc annotations.

Eval run focus view (SingleScenarioViewerPOC + ScenarioAnnotationPanel): Full-page layout — inputs+outputs on left (7/12), annotation panel on right (5/12). Has prev/next navigation via ScenarioNavigator. This is the closest to what we need.

What does NOT exist today:

No assignment (all scenarios shown to all users, no queue API calls)

No progress tracking (no X/Y counter anywhere)

No auto-advance after annotation

No unified view for traces + testcases (separate code paths)

What needs to change (built on top of Focus View):

New queue page + queue list page (~2 days)

Adapt Focus View for queue context — fetch assigned items, render trace data or testcase data uniformly (~3 days)

Assignment integration (~1 day)

Progress tracking (~1 day)

Navigation improvements: auto-advance, keyboard shortcuts (~1 day)

Total: ~8 days FE work

mmabrouk · 2026-02-24T20:22:49Z

docs/design/annotation-queue-v2/rfc-v2.md

+
+### Annotation Mode (View Swap)
+
+Instead of a separate "annotation queue" page, the annotation experience lives **inside existing views**. The user switches to "annotation mode" on the current view:


This is mostly orthogonal to this RFC. Having an annotation mode view does not require annotation queues since it local / main user only and stateless.

Agreed — moved annotation mode (view swap) to a separate note explicitly marked as orthogonal and out of scope for this RFC. It's local/stateless and doesn't require queue infrastructure. Could be built in parallel as an independent feature.

mmabrouk · 2026-02-24T20:23:18Z

docs/design/annotation-queue-v2/rfc-v2.md

+- Same as today but with actual assignment from the queue
+- Annotator only sees their assigned scenarios
+
+This approach avoids creating a separate "annotation queue" page. The queue is a background concept — the user works inside the views they already know.


We will have and need to have annotation queue page

Agreed! Rewrote the section — it's now a dedicated "Annotation Queues" page in the sidebar nav, listing all queues assigned to the user with progress, labels, and source info. This is a first-class page, not a hidden concept.

mmabrouk · 2026-02-24T20:24:50Z

docs/design/annotation-queue-v2/rfc-v2.md

+└─────────────────────────────────────────────────────────────────────┘
+```
+
+Clicking "Open" navigates to the appropriate view (observability for trace queues, test set view for test set queues, eval run details for eval queues) in annotation mode, filtered to the user's assigned items.


No the annotation view is the same for all three. It is an extension of what we have now allowing just showing multiple types of data

Agreed — updated. The annotation view is now explicitly described as the same for all three source types. It's an extension of the existing annotation drawer/eval table that renders trace data and testcase data uniformly (both have inputs/outputs). The annotation form on the side is always the same, driven by the evaluator's JSON schema.

One view, multiple data types.

mmabrouk · 2026-02-24T20:25:29Z

docs/design/annotation-queue-v2/rfc-v2.md

+
+## Open Questions
+
+1. **Evaluations without inputs:** How much backend work is needed to support runs with no input steps? Are there assumptions in the scenario/result seeding that require inputs?


well research and answer

Researched and answered — see Appendix A in the RFC.

TL;DR: The data model is ready (all fields nullable), but 3 blocking changes are needed: relax the start() gate, add trace_ids to SimpleEvaluationData, and implement a new trace-batch worker modeled after the existing live eval flow. ~2-3 days of backend work. No DB migrations needed.

Major changes based on review feedback: - Fix Use Case 1 flow: queue-first (send to queue), not evaluator-first - Add deletion handling for annotation queues - Add FE progress/status tracking details (per-item, overall, editing) - Add Appendix A: technical analysis of evaluations without inputs - Rewrite API: drop run_id source, labels-first (not evaluators), /export endpoint - Rewrite UI: dedicated Annotation Queues page, unified annotation view for all sources - Fix Use Case 3: orchestrator creates queue, documents current broken behavior - Separate annotation mode (orthogonal) from annotation queues - Add queue lifecycle (Active → Completed → Archived) - Add traces→testcases export capability

Copilot

Pull request overview

This PR extends the existing evaluation queue infrastructure to support “ad-hoc” / simplified annotation-queue style workflows (plus accompanying design docs for Annotation Queue v2).

Changes:

Adds an is_adhoc run flag and introduces “Simple Queues” (API/service + worker tasks) built on top of EvaluationRun + EvaluationQueue.
Extends queue assignment/partitioning with batch_size / batch_offset, plus a denormalized evaluation_queues.user_ids column for assignee filtering.
Adds a design workspace under docs/design/annotation-queue-v2/ (RFCs, PRD, research notes).

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
web/oss/src/lib/hooks/usePreviewEvaluations/index.ts	Adds `is_adhoc` to run-flags filter shape for FE querying.
web/oss/src/components/EvaluationRunsTablePOC/constants.ts	Adds `is_adhoc` flag key + label for UI filtering/display.
sdk/agenta/client/backend/types/evaluation_queue_data.py	Adds `batch_size` / `batch_offset` to SDK queue data type.
docs/design/annotation-queue-v2/rfc.md	Adds initial RFC with 3 approaches and recommendation.
docs/design/annotation-queue-v2/rfc-v2.md	Adds updated RFC focusing on a convenience layer over evaluation entities.
docs/design/annotation-queue-v2/*.md	Adds supporting research/PRD/context/analysis documents.
docs/design/annotation-queue-v2/README.md	Adds an index/entrypoint for the design workspace.
api/oss/tests/pytest/unit/test_evaluation_queue_assignment_utils.py	Adds/updates unit tests for queue assignment behavior (batch size/offset semantics).
api/oss/src/tasks/taskiq/evaluations/worker.py	Registers new batch tasks (invocation/traces/testcases) and makes `evaluate_live_query` timestamps optional.
api/oss/src/dbs/postgres/evaluations/utils.py	Adjusts run-flag inference for ad-hoc runs (including step-key based inference).
api/oss/src/dbs/postgres/evaluations/dbes.py	Adds GIN index for `evaluation_queues.user_ids`.
api/oss/src/dbs/postgres/evaluations/dbas.py	Adds `user_ids` ARRAY(UUID) column to EvaluationQueue DBA.
api/oss/src/dbs/postgres/evaluations/dao.py	Populates/queries `user_ids` for queue assignee filtering; adds flatten helper.
api/oss/src/core/evaluations/utils.py	Updates scenario assignment to support `batch_size` / `batch_offset`.
api/oss/src/core/evaluations/types.py	Adds `is_adhoc`, queue `batch_*` fields + validators, and introduces SimpleQueue DTOs.
api/oss/src/core/evaluations/tasks/legacy.py	Adds new batch evaluation flows for invocation-only and ad-hoc trace/testcase ingestion.
api/oss/src/core/evaluations/service.py	Wires `batch_*` into assignment, adds batch dispatch helpers, and introduces `SimpleQueuesService`.
api/oss/src/apis/fastapi/evaluations/router.py	Adds `/preview/simple/queues` endpoints (create/query/fetch/add items/query scenarios).
api/oss/src/apis/fastapi/evaluations/models.py	Adds request/response models for SimpleQueues endpoints.
api/oss/databases/postgres/migrations/core/versions/*.py	Adds migrations for `is_adhoc` in run flags and `user_ids` on evaluation_queues (OSS).
api/ee/databases/postgres/migrations/core/versions/*.py	Mirrors the same migrations for EE.
api/entrypoints/worker_evaluations.py	Injects `TestcasesService` into the evaluations worker.
api/entrypoints/routers.py	Wires `SimpleQueuesService` and includes the new router under `/preview/simple/queues`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T08:41:26Z

api/oss/src/core/evaluations/tasks/legacy.py

+                    root_span = list(trace.spans.values())[0]
+                    if isinstance(root_span, list):
+                        scenario_status = EvaluationStatus.ERRORS
+                        run_has_errors = True
+                    else:
+                        query_span_id = root_span.span_id
+                        _trace = trace.model_dump(mode="json", exclude_none=True)


root_span = list(trace.spans.values())[0] assumes the first entry in the OTelNestedSpans dict is the root span. Since spans is a dict (Dict[str, Union[OTelSpan, List[OTelSpan]]]), key/value order isn’t a reliable way to identify the root, and this can link annotations to the wrong span. Consider determining the root by parent_id is None (after flattening) or by using a tracing helper that returns the root span explicitly.

Suggested change

root_span = list(trace.spans.values())[0]

if isinstance(root_span, list):

scenario_status = EvaluationStatus.ERRORS

run_has_errors = True

else:

query_span_id = root_span.span_id

_trace = trace.model_dump(mode="json", exclude_none=True)

# Flatten all spans (values may be a single span or a list of spans)

all_spans = []

for span_value in trace.spans.values():

if isinstance(span_value, list):

all_spans.extend(span_value)

else:

all_spans.append(span_value)

# Determine the root span by parent_id being None

root_span = None

for span in all_spans:

# Use getattr for safety in case some span objects lack parent_id

if getattr(span, "parent_id", None) is None:

root_span = span

break

if root_span is None:

scenario_status = EvaluationStatus.ERRORS

run_has_errors = True

else:

query_span_id = root_span.span_id

Copilot · 2026-02-27T08:41:27Z

api/oss/src/core/evaluations/tasks/legacy.py

+        if human_step_keys:
+            existing_queues = await evaluations_service.query_queues(
+                project_id=project_id,
+                queue=EvaluationQueueQuery(
+                    run_id=run_id,
+                ),
+            )
+            has_run_queue = any(queue.run_id == run_id for queue in existing_queues)
+            if not has_run_queue:
+                await evaluations_service.create_queue(
+                    project_id=project_id,
+                    user_id=user_id,
+                    queue=EvaluationQueueCreate(
+                        run_id=run_id,
+                        status=EvaluationStatus.RUNNING,
+                        data=EvaluationQueueData(
+                            scenario_ids=[s.id for s in scenarios if s.id],
+                            step_keys=human_step_keys,
+                        ),
+                    ),
+                )


When human_step_keys is present, the code only creates an EvaluationQueue if none exists. If a queue already exists with a data.scenario_ids subset filter, newly created scenarios won’t be added to that list and therefore won’t be returned by fetch_queue_scenarios() (which respects queue.data.scenario_ids). Consider updating the existing queue’s data.scenario_ids to include the new scenario IDs (or ensuring queues created for ad-hoc batching never rely on scenario_ids).

Copilot · 2026-02-27T08:41:27Z

api/oss/src/core/evaluations/tasks/legacy.py

+
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        log.error(
+            f"An error occurred during trace batch evaluation: {e}",


The exception log message in _evaluate_batch_items() hardcodes "trace batch evaluation", but this helper is used for both trace and testcase batches. This makes logs misleading when debugging failures for testcase-backed queues. Consider using a generic message (e.g., "batch items evaluation") or include which of trace_ids/testcase_ids was provided.

Suggested change

f"An error occurred during trace batch evaluation: {e}",

f"An error occurred during batch items evaluation: {e}",

Copilot · 2026-02-27T08:41:27Z

api/oss/src/dbs/postgres/evaluations/utils.py

+                step_key = (_step.key or "").lower()
+
+                if "query" in step_key:
                    flags.has_queries = True
+                if "testset" in step_key:
+                    flags.has_testsets = True



_make_run_flags() infers has_queries/has_testsets for ad-hoc runs by checking substrings in the step key when references is empty. This creates a brittle coupling where renaming the step key (e.g., from query-direct) silently changes flag computation and downstream behavior (like kind detection). Prefer making the kind explicit in references or meta, or storing a dedicated field/flag on the run data for ad-hoc queue kind.

Suggested change

step_key = (_step.key or "").lower()

if "query" in step_key:

flags.has_queries = True

if "testset" in step_key:

flags.has_testsets = True

# Prefer explicit kind information (e.g., from step metadata)

inferred_from_meta = False

step_meta = getattr(_step, "meta", None) or {}

if isinstance(step_meta, dict):

step_kind = step_meta.get("kind")

# Allow kind to be a single string or an iterable of strings

kinds_to_check = []

if isinstance(step_kind, str):

kinds_to_check = [step_kind]

elif isinstance(step_kind, (list, tuple, set)):

kinds_to_check = list(step_kind)

for kind_item in kinds_to_check:

kind_str = str(kind_item).lower()

if "query" in kind_str:

flags.has_queries = True

inferred_from_meta = True

if "testset" in kind_str:

flags.has_testsets = True

inferred_from_meta = True

# Fallback to step key substring heuristics only if meta did not specify a kind

if not inferred_from_meta:

step_key = (_step.key or "").lower()

if "query" in step_key:

flags.has_queries = True

if "testset" in step_key:

flags.has_testsets = True

Copilot · 2026-02-27T08:41:27Z

api/oss/src/core/evaluations/service.py

    ) -> EvaluationRunFlags:
        return EvaluationRunFlags(
            is_closed=is_closed or False,
            is_live=is_live or False,
            is_active=is_active or False,
+            is_adhoc=is_adhoc or False,
+            has_queries=has_queries or False,
+            has_testsets=has_testsets or False,
+            has_evaluators=has_evaluators or False,
+            has_custom=has_custom or False,
+            has_human=has_human or False,
+            has_auto=has_auto or False,


_make_evaluation_run_flags() coerces None to False (e.g., is_closed=is_closed or False). Because _make_evaluation_run_query() uses this helper, omitted flag filters end up being applied as false in flags.contains(...), which can silently over-filter query results. Consider building an EvaluationRunQueryFlags (optional fields) for query paths and serializing with exclude_none=True so only explicitly provided filters are applied.

Suggested change

) -> EvaluationRunFlags:

return EvaluationRunFlags(

is_closed=is_closed or False,

is_live=is_live or False,

is_active=is_active or False,

is_adhoc=is_adhoc or False,

has_queries=has_queries or False,

has_testsets=has_testsets or False,

has_evaluators=has_evaluators or False,

has_custom=has_custom or False,

has_human=has_human or False,

has_auto=has_auto or False,

) -> EvaluationRunQueryFlags:

return EvaluationRunQueryFlags(

is_closed=is_closed,

is_live=is_live,

is_active=is_active,

is_adhoc=is_adhoc,

has_queries=has_queries,

has_testsets=has_testsets,

has_evaluators=has_evaluators,

has_custom=has_custom,

has_human=has_human,

has_auto=has_auto,

Copilot · 2026-02-27T08:41:28Z

api/oss/src/core/evaluations/service.py

+                has_query_steps = bool(_evaluation.data.query_steps)
+                has_testset_steps = bool(_evaluation.data.testset_steps)
+                has_application_steps = bool(_evaluation.data.application_steps)
+                has_evaluator_steps = bool(_evaluation.data.evaluator_steps)
+
+                if has_testset_steps and has_application_steps and has_evaluator_steps:
+                    await self.evaluations_worker.evaluate_batch_testset.kiq(
                        project_id=project_id,
                        user_id=user_id,
                        #
                        run_id=run.id,
                    )

-                elif _evaluation.data.testset_steps:
-                    await self.evaluations_worker.evaluate_batch_testset.kiq(
+                elif (
+                    has_testset_steps
+                    and has_application_steps
+                    and not has_evaluator_steps
+                    and not has_query_steps
+                ):
+                    await self.evaluations_worker.evaluate_batch_invocation.kiq(
                        project_id=project_id,
                        user_id=user_id,
                        #
                        run_id=run.id,
                    )

+                else:
+                    log.warning(
+                        "[EVAL] [start] [skip] unsupported non-live run topology",
+                        run_id=run.id,
+                        has_query_steps=has_query_steps,
+                        has_testset_steps=has_testset_steps,
+                        has_application_steps=has_application_steps,
+                        has_evaluator_steps=has_evaluator_steps,
+                    )


start() no longer dispatches any worker task for non-live runs that have query_steps (and no testset steps). Those runs now fall into the "unsupported non-live run topology" branch and will never execute. If query-based (non-live) evaluations are still supported, add an explicit dispatch path (e.g., evaluate_live_query or a dedicated batch-query task) or reject such runs earlier with a clear error.

Copilot · 2026-02-27T08:41:28Z

api/oss/src/core/evaluations/types.py

+
+    repeats: Optional[int] = None
+
+    assignments: Optional[Union[List[List[UUID]], List[UUID]]] = None


SimpleQueueData.assignments is typed as Union[List[List[UUID]], List[UUID]], which is ambiguous for JSON payloads (a list of UUID strings can match both shapes depending on validation/union ordering). This can lead to inconsistent parsing between clients and makes the API contract unclear. Prefer a single canonical shape (e.g., always List[List[UUID]], where the single-repeat case is a 1-element outer list) or introduce an explicit discriminator field.

Suggested change

assignments: Optional[Union[List[List[UUID]], List[UUID]]] = None

# assignments is always a list of lists; a single-repeat case should use a 1-element outer list

assignments: Optional[List[List[UUID]]] = None

Copilot · 2026-02-27T08:41:28Z

api/oss/src/core/evaluations/service.py

+    def _normalize_assignments(
+        self,
+        *,
+        assignments: Optional[List[List[UUID]] | List[UUID]],
+    ) -> Optional[List[List[UUID]]]:
+        if assignments is None:
+            return None
+
+        if len(assignments) == 0:
+            return None
+
+        first_item = assignments[0]
+        if isinstance(first_item, list):
+            return [
+                [UUID(str(user_id)) for user_id in repeat_user_ids]
+                for repeat_user_ids in assignments
+            ]
+
+        return [[UUID(str(user_id)) for user_id in assignments]]


_normalize_assignments() infers whether assignments is 1D vs 2D by inspecting assignments[0]. With the current Union[List[List[UUID]], List[UUID]] API type, this inference is fragile (and can misbehave if deserialization produces unexpected shapes). Consider normalizing at the schema level (single accepted type) and validating repeats/shape explicitly rather than relying on runtime type checks.

github-actions · 2026-02-27T09:01:37Z

Railway Preview Environment


Image tag	`pr-3814-b8b2f65`
Status	Failed
Logs	View workflow run

Updated at 2026-02-27T09:23:46.471Z

devin-ai-integration

Devin Review found 1 new potential issue.

View 12 additional findings in Devin Review.

devin-ai-integration · 2026-02-27T09:27:58Z

...bases/postgres/migrations/core/versions/d7e8f9a0b1c2_add_is_adhoc_to_evaluation_run_flags.py

+
+# revision identifiers, used by Alembic.
+revision: str = "d7e8f9a0b1c2"
+down_revision: Union[str, None] = "c2d3e4f5a6b7"


🔴 Alembic migration d7e8f9a0b1c2 forks the migration chain (duplicate down_revision)

The new migration d7e8f9a0b1c2_add_is_adhoc_to_evaluation_run_flags.py declares down_revision = "c2d3e4f5a6b7". However, the pre-existing migration e5f6a1b2c3d4_add_tool_connections_table.py already has down_revision = "c2d3e4f5a6b7" (confirmed by grep). This creates two branches from the same parent revision in the Alembic migration graph.

Root Cause and Impact

Alembic expects a linear chain where each revision has exactly one child (unless explicit branch labels are used). When two migrations declare the same down_revision, Alembic detects "multiple heads" and alembic upgrade head fails with:

alembic.util.exc.CommandError: Multiple head revisions are present; please specify a specific target revision

This blocks all database migrations in both OSS and EE deployments.

The fix is to set down_revision of d7e8f9a0b1c2 to the actual current head of the migration chain (which should be e5f6a1b2c3d4 or whichever migration is currently at the tip), not the grandparent c2d3e4f5a6b7.

The same issue exists in the EE mirror at api/ee/databases/postgres/migrations/core/versions/d7e8f9a0b1c2_add_is_adhoc_to_evaluation_run_flags.py.

Prompt for agents

In both api/oss/databases/postgres/migrations/core/versions/d7e8f9a0b1c2_add_is_adhoc_to_evaluation_run_flags.py (line 15) and api/ee/databases/postgres/migrations/core/versions/d7e8f9a0b1c2_add_is_adhoc_to_evaluation_run_flags.py (line 15), change the down_revision from "c2d3e4f5a6b7" to the actual current head of the migration chain. Run `alembic heads` to find the current head revision ID. It is likely "e5f6a1b2c3d4" (from add_tool_connections_table) or a later migration. Update both files to point to that revision. Then verify with `alembic check` or `alembic heads` that only a single head exists.

Was this helpful? React with 👍 or 👎 to provide feedback.

mmabrouk

Thanks @jp-agenta added comments

mmabrouk · 2026-02-27T08:52:02Z

api/oss/src/core/evaluations/types.py

+class SimpleQueueData(BaseModel):
+    kind: SimpleQueueKind
+
+    evaluator_steps: Optional[Target] = None


The nomenclature here is a bit weird. Why steps? Steps made sense in the evaluation_run context since there was a sort of a sequence. Here it's just a list of evaluators. The naming of this public interface is just made to fit the name of the internal interface. It might be confusing to users of this interface.

mmabrouk · 2026-02-27T08:53:56Z

api/oss/src/core/evaluations/types.py

+
+    repeats: Optional[int] = None
+
+    assignments: Optional[Union[List[List[UUID]], List[UUID]]] = None


I guess these are user assignments? I needed to read the code to understand that this is project and user assignment in a certain order. I would wonder how could this information be surfaced in openapi.json/our docs. One option is docstring, but also creating objects for each would help.

mmabrouk · 2026-02-27T08:58:04Z

api/oss/src/core/evaluations/types.py

+    queue_ids: Optional[List[UUID]] = None
+
+
+class SimpleQueueScenariosQuery(Identifier):


I think there should be an option to query scenarios by status too (completed or not). Otherwise if a user creates a queue and uses it over a long time, at some point the result would become huge.

mmabrouk · 2026-02-27T09:00:10Z

api/oss/src/core/evaluations/service.py

+                name=queue.name,
+                description=queue.description,
+                #
+                flags=EvaluationRunFlags(


we need some flag to not show the evaluation run in the evaluation run list. The queue for test sets is not expected to be shown as an evaluation run.

mmabrouk · 2026-02-27T09:01:49Z

api/oss/src/core/evaluations/service.py

+                description=queue.description,
+                #
+                flags=EvaluationQueueFlags(
+                    is_sequential=False,


mmabrouk · 2026-02-27T09:21:46Z

api/oss/src/core/evaluations/tasks/legacy.py

+                        run_id=run_id,
+                        scenario_id=scenario.id,
+                    ),
+                )


The logic here is a bit convoluted in my opinion. It seems this function does everything and handles every case so that when it is called in the sequence where the user is creating a simple annotation queue, it creates the run and then checks again whether a queue exist, and if they are adding traces it creates the run.

I would rather have this logic somewhere else so that responsability (of creating the run) does not lie on this function. If it is stay here it is a large vector for bugs (for instance it's assuming each evaluation run can have only one queue)

mmabrouk · 2026-02-27T09:31:26Z

api/oss/src/core/evaluations/types.py

    is_live: bool = False  # Indicates if the run has live queries
    is_active: bool = False  # Indicates if the run is currently active
    is_closed: bool = False  # Indicates if the run is modifiable
+    is_adhoc: bool = False  # Indicates ad-hoc/bucket run behavior


naming is weird. something explicit and simple would be helpful in my opinion. something with the word queue? since here adhoc and modifiable is basically the same.

mmabrouk added 3 commits February 12, 2026 14:45

docs: add human eval implementation research for annotation queue v2

dbff581

vercel bot deployed to Preview February 24, 2026 13:42 View deployment

docs: anonymize competitive analysis API patterns

2e965c1

vercel bot deployed to Preview February 24, 2026 13:46 View deployment

vercel bot deployed to Preview February 24, 2026 19:02 View deployment

mmabrouk commented Feb 24, 2026

View reviewed changes

vercel bot deployed to Preview February 24, 2026 20:42 View deployment

only auto-run 'auto' steps

324991b

vercel bot deployed to Preview February 26, 2026 14:40 View deployment

fix human/custom steps

3c4f8dd

vercel bot deployed to Preview February 26, 2026 14:44 View deployment

jp-agenta added 2 commits February 26, 2026 16:01

Merge main

bc47b59

initial CRs

6684ca7

vercel bot deployed to Preview February 26, 2026 15:05 View deployment

second pass after merging main

d64fe5b

vercel bot deployed to Preview February 26, 2026 15:08 View deployment

jp-agenta changed the title ~~docs: annotation queue v2 design documents~~ [feat] Extend queues Feb 26, 2026

Merge branch 'main' into feature/annotation-queue-v2

9a75aa9

junaway marked this pull request as ready for review February 27, 2026 08:35

Copilot AI review requested due to automatic review settings February 27, 2026 08:35

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 27, 2026

Copilot started reviewing on behalf of junaway February 27, 2026 08:35 View session

dosubot bot added Backend feature labels Feb 27, 2026

vercel bot deployed to Preview February 27, 2026 08:37 View deployment

move CRs

85ef3a7

Copilot AI reviewed Feb 27, 2026

View reviewed changes

vercel bot deployed to Preview February 27, 2026 08:42 View deployment

This comment was marked as resolved.

Sign in to view

consolidate CR

49aabe2

vercel bot deployed to Preview February 27, 2026 08:51 View deployment

This comment was marked as resolved.

Sign in to view

fix P0

0a04280

vercel bot deployed to Preview February 27, 2026 09:21 View deployment

devin-ai-integration bot reviewed Feb 27, 2026

View reviewed changes

mmabrouk commented Feb 27, 2026

View reviewed changes

		Key design choice: annotating ≠ modifying the test set. The annotation step creates annotation traces (OTel spans). These reference the test cases but don't modify them. Writing back to the test set is a separate, explicit action that creates a new revision. This preserves test case immutability and versioning.

		Constraint: Test cases are immutable today — changing content creates new IDs, and changes only stick when attached to a revision. The write-back step must respect this by creating a new revision, not mutating existing test cases.


		What happens behind the scenes:

		1. User creates evaluation run with both auto and human evaluators (existing flow)


		Creates a new test set revision with annotation values as new columns.

		---


		### Annotation Mode (View Swap)

		Instead of a separate "annotation queue" page, the annotation experience lives inside existing views. The user switches to "annotation mode" on the current view:


		## Open Questions

		1. Evaluations without inputs: How much backend work is needed to support runs with no input steps? Are there assumptions in the scenario/result seeding that require inputs?

-                    root_span = list(trace.spans.values())[0]
-                    if isinstance(root_span, list):
-                        scenario_status = EvaluationStatus.ERRORS
-                        run_has_errors = True
-                    else:
-                        query_span_id = root_span.span_id
-                        _trace = trace.model_dump(mode="json", exclude_none=True)
+                    # Flatten all spans (values may be a single span or a list of spans)
+                    all_spans = []
+                    for span_value in trace.spans.values():
+                        if isinstance(span_value, list):
+                            all_spans.extend(span_value)
+                        else:
+                            all_spans.append(span_value)
+                    # Determine the root span by parent_id being None
+                    root_span = None
+                    for span in all_spans:
+                        # Use getattr for safety in case some span objects lack parent_id
+                        if getattr(span, "parent_id", None) is None:
+                            root_span = span
+                            break
+                    if root_span is None:
+                        scenario_status = EvaluationStatus.ERRORS
+                        run_has_errors = True
+                    else:
+                        query_span_id = root_span.span_id

	f"An error occurred during trace batch evaluation: {e}",
	f"An error occurred during batch items evaluation: {e}",

-                step_key = (_step.key or "").lower()
-                if "query" in step_key:
-                    flags.has_queries = True
-                if "testset" in step_key:
-                    flags.has_testsets = True
+                # Prefer explicit kind information (e.g., from step metadata)
+                inferred_from_meta = False
+                step_meta = getattr(_step, "meta", None) or {}
+                if isinstance(step_meta, dict):
+                    step_kind = step_meta.get("kind")
+                    # Allow kind to be a single string or an iterable of strings
+                    kinds_to_check = []
+                    if isinstance(step_kind, str):
+                        kinds_to_check = [step_kind]
+                    elif isinstance(step_kind, (list, tuple, set)):
+                        kinds_to_check = list(step_kind)
+                    for kind_item in kinds_to_check:
+                        kind_str = str(kind_item).lower()
+                        if "query" in kind_str:
+                            flags.has_queries = True
+                            inferred_from_meta = True
+                        if "testset" in kind_str:
+                            flags.has_testsets = True
+                            inferred_from_meta = True
+                # Fallback to step key substring heuristics only if meta did not specify a kind
+                if not inferred_from_meta:
+                    step_key = (_step.key or "").lower()
+                    if "query" in step_key:
+                        flags.has_queries = True
+                    if "testset" in step_key:
+                        flags.has_testsets = True


		repeats: Optional[int] = None

		assignments: Optional[Union[List[List[UUID]], List[UUID]]] = None

	assignments: Optional[Union[List[List[UUID]], List[UUID]]] = None
	# assignments is always a list of lists; a single-repeat case should use a 1-element outer list
	assignments: Optional[List[List[UUID]]] = None

		queue_ids: Optional[List[UUID]] = None


		class SimpleQueueScenariosQuery(Identifier):

Conversation

mmabrouk commented Feb 24, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related

Uh oh!

vercel bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

mmabrouk commented Feb 24, 2026 •

edited by devin-ai-integration bot

Loading

vercel bot commented Feb 24, 2026 •

edited

Loading