Skip to content

Conversation

@aantn
Copy link
Collaborator

@aantn aantn commented Jan 7, 2026

Adds eval 195

Summary by CodeRabbit

  • Tests
    • Added test case for validating ConfigMap counting across multiple Kubernetes namespaces with automated setup, execution, and cleanup procedures.

✏️ Tip: You can customize this high-level summary in your review settings.

This test creates 49 ConfigMaps in app-183-alpha and 62 in app-183-beta,
then asks Holmes to count them per namespace. Tagged with toolset-limitation
since there's no proper grouping/aggregation tool - the LLM must fetch raw
lists and count manually, which is error-prone for large datasets.

Signed-off-by: Claude <noreply@anthropic.com>
@netlify
Copy link

netlify bot commented Jan 7, 2026

Deploy Preview for holmes-docs ready!

Name Link
🔨 Latest commit ba53c68
🔍 Latest deploy log https://app.netlify.com/projects/holmes-docs/deploys/6961aceee0e8ca0008e39d97
😎 Deploy Preview https://deploy-preview-1337--holmes-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 7, 2026

CLA Not Signed

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

📂 Previous Runs

📜 Run @ 64bd141 (#20870620770)

✅ Results of HolmesGPT evals

Automatically triggered by commit 64bd141 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 9/9 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
09_crashpod 30.2s ↓11% 5 11 $0.0959
101_loki_historical_logs_pod_deleted 47.3s ↓21% 7 13 $0.1993
111_pod_names_contain_service 43.8s ±0% 8 20 $0.1527
12_job_crashing 51.0s ±0% 9 18 $0.1739
162_get_runbooks 51.4s ±0% 8 17 $0.2318
176_network_policy_blocking_traffic_no_runbooks 38.4s ±0% 6 15 $0.1707
24_misconfigured_pvc 40.2s ±0% 7 17 $0.1276
43_current_datetime_from_prompt 3.3s ±0% 1 $0.0085
61_exact_match_counting 10.6s ±0% 3 3 $0.0326
Total 35.1s avg 6.0 avg 14.2 avg $1.1931

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📜 Run @ 98b6b69 (#20870444675)

✅ Results of HolmesGPT evals

Automatically triggered by commit 98b6b69 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 9/9 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
09_crashpod 34.4s ±0% 6 13 $0.1701
101_loki_historical_logs_pod_deleted 42.2s ↓29% 6 11 $0.1738
111_pod_names_contain_service 40.0s ±0% 7 15 $0.1738
12_job_crashing 54.8s ±0% 9 23 $0.2599
162_get_runbooks 47.6s ↓10% 8 15 $0.2278
176_network_policy_blocking_traffic_no_runbooks 35.4s ↓16% 6 14 $0.1707
24_misconfigured_pvc 38.5s ±0% 7 16 $0.1805
43_current_datetime_from_prompt 4.1s ↑15% 1 $0.0618
61_exact_match_counting 10.9s ±0% 3 3 $0.0858
Total 34.2s avg 5.9 avg 13.8 avg $1.5042

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📜 Run @ be77775 (#20870100133)

✅ Results of HolmesGPT evals

Automatically triggered by commit be77775 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 9/9 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
09_crashpod 33.1s ±0% 5 12 $0.1525
101_loki_historical_logs_pod_deleted 58.7s ±0% 9 15 $0.2264
111_pod_names_contain_service 50.3s ↑23% 8 17 $0.1949
12_job_crashing 58.1s ±0% 10 18 $0.2176
162_get_runbooks 55.8s ±0% 8 16 $0.2240
176_network_policy_blocking_traffic_no_runbooks 56.0s ↑32% 7 15 $0.1874
24_misconfigured_pvc 45.1s ↑14% 7 18 $0.1816
43_current_datetime_from_prompt 4.0s ↑12% 1 $0.0618
61_exact_match_counting 14.1s ↑22% 3 3 $0.0860
Total 41.7s avg 6.4 avg 14.2 avg $1.5321

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📜 Run @ 9cf4973 (#20870069850)

✅ Results of HolmesGPT evals

Automatically triggered by commit 9cf4973 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 9/9 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
09_crashpod 37.7s 6 14 $0.1205
101_loki_historical_logs_pod_deleted 46.1s 6 11 $0.1242
111_pod_names_contain_service 38.0s 6 13 $0.1034
12_job_crashing 60.2s 9 23 $0.1957
162_get_runbooks 44.6s 6 16 $0.1486
176_network_policy_blocking_traffic_no_runbooks 44.4s 7 16 $0.1365
24_misconfigured_pvc 48.1s 8 18 $0.1431
43_current_datetime_from_prompt 4.0s 1 $0.0085
61_exact_match_counting 12.8s 3 3 $0.0326
Total 37.3s avg 5.8 avg 14.2 avg $1.0132

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📜 Run @ 127b5a1 (#20774730506)

✅ Results of HolmesGPT evals

Automatically triggered by commit 127b5a1 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 9/9 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
09_crashpod 31.8s ±0% 6 13 $0.1674
101_loki_historical_logs_pod_deleted 50.6s ±0% 8 17 $0.2190
111_pod_names_contain_service 37.8s ±0% 7 16 $0.1842
12_job_crashing 51.1s ±0% 9 20 $0.2464
162_get_runbooks 53.4s ↑13% 8 17 $0.2440
176_network_policy_blocking_traffic_no_runbooks 38.5s ±0% 7 15 $0.1938
24_misconfigured_pvc 36.8s ±0% 7 16 $0.1765
43_current_datetime_from_prompt 3.0s ±0% 1 $0.0618
61_exact_match_counting 10.8s ±0% 3 3 $0.0870
Total 34.9s avg 6.2 avg 14.6 avg $1.5800

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 26 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)

✅ Results of HolmesGPT evals

Automatically triggered by commit ba53c68 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

  • ask_holmes: 9/9 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
09_crashpod 30.7s 5 11 $0.0981
101_loki_historical_logs_pod_deleted 43.7s 8 11 $0.1217
111_pod_names_contain_service 53.1s 9 20 $0.1601
12_job_crashing 49.0s 8 18 $0.1596
162_get_runbooks 52.5s 8 16 $0.1588
176_network_policy_blocking_traffic_no_runbooks 49.1s 8 16 $0.1443
24_misconfigured_pvc 39.1s 7 16 $0.1205
43_current_datetime_from_prompt 3.1s 1 $0.0085
61_exact_match_counting 11.8s 3 3 $0.0326
Total 36.9s avg 6.3 avg 13.9 avg $1.0042

Historical comparison unavailable: No historical metrics found (no passing tests with duration data, excluding branch 'claude/add-counting-tool-PA7Nz')

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: No historical metrics found (no passing tests with duration data, excluding branch 'claude/add-counting-tool-PA7Nz')

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
markers: regression

Or with more options (one per line):

/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
markers: regression
Option Description
model Model(s) to test (default: same as automatic runs)
markers Pytest markers (no default - runs all tests!)
filter Pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

🏷️ Valid markers

benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, frontend, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

Walkthrough

Adds a new test case fixture file for testing ConfigMap counting across Kubernetes namespaces. The test case includes setup scripts to create 49 ConfigMaps in one namespace and 62 in another, validation logic to verify counts, and cleanup scripts to remove test resources.

Changes

Cohort / File(s) Summary
Test Case Fixture
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
New test case YAML fixture defining: user prompt for ConfigMap counts across two namespaces (app-184-alpha and app-184-beta), expected outputs (50 and 63 respectively), before_test script for namespace and ConfigMap creation with parallel execution and count validation, and after_test cleanup script.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • moshemorad
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add eval test for counting ConfigMaps per namespace' directly and clearly summarizes the main change—a new test case YAML fixture that evaluates counting ConfigMaps across namespaces.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

Docker image ready for ae53bca (built in 41s)

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use this tag to pull the image for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ae53bca
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ae53bca

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:ae53bca

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:ae53bca

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
@tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml:
- Around line 1-52: Add a top-level runbooks field to the YAML test case by
inserting runbooks: {} at the root of the file (alongside user_prompt,
expected_output, tags, before_test, after_test) so the test includes the
required empty runbooks object; ensure it's placed at the top level, not nested
under another key.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54feab6 and f07dad7.

📒 Files selected for processing (1)
  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
🧰 Additional context used
📓 Path-based instructions (4)
tests/llm/**/*.{py,yaml}

📄 CodeRabbit inference engine (CLAUDE.md)

All pod names must be unique across tests (never reuse pod names between tests)

Files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
tests/llm/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/**/*.yaml: Never use resource names that hint at the problem or expected behavior in evals (avoid broken-pod, test-project-that-does-not-exist, crashloop-app)
Only use valid tags from pyproject.toml for LLM tests - invalid tags cause test collection failures
Use exit 1 when setup verification fails to fail the test early
Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
Use kubectl exec over port forwarding for setup verification to avoid port conflicts
Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)
Use retry loops for kubectl wait to handle race conditions, don't use bare kubectl wait immediately after resource creation
Use realistic logs in eval tests, not fake/obvious logs like 'Memory usage stabilized at 800MB'
Use realistic filenames in eval tests, not hints like 'disk_consumer.py' - use names like 'training_pipeline.py'
Use real-world scenarios in eval tests (ML pipelines with checkpoint issues, database connection pools) not simulated scenarios

Files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/test_ask_holmes/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Use sequential test numbers for eval tests, checking existing tests for next available number

Files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/fixtures/**/*.yaml: Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Never use :latest container tags - use specific versions like grafana/grafana:12.3.1
Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Match user prompt to test - prompt must explicitly request what you're testing
Don't use technical terms that give away solutions in user prompts - use anti-cheat prompts that prevent domain knowledge shortcuts
Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context
Use include_tool_calls: true to verify tool was called when output values are too generic to rule out hallucinations
Use neutral, application-specific names in eval resources instead of obvious technical terms to prevent domain knowledge cheats
Avoid hint-giving resource names - use realistic business context (checkout-api, user-service, inventory-db) not obvious problem indicators (broken-pod, payment-service-1)
Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
Toolset config in eval tests must go under config field: toolsets.toolset_name.enabled.config

Files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/**/*.yaml : Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Match user prompt to test - prompt must explicitly request what you're testing

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Be specific in expected_output - test exact values like title or unique injected values, not generic patterns

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: llm_evals
  • GitHub Check: build
  • GitHub Check: build (3.12)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (4)
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml (4)

18-30: LGTM - Efficient parallel ConfigMap creation.

The parallel creation with background processes and wait is efficient. The verification step (lines 32-46) ensures the correct counts are achieved.


32-48: LGTM - Proper verification with early exit on failure.

The verification correctly checks exact ConfigMap counts and uses exit 1 to fail the test early if setup doesn't match expected state, as per coding guidelines.


14-16: Namespace names are unique across all tests. ✓


1-11: Test number 183 is properly sequential, and all tags (kubernetes, counting, toolset-limitation) are valid per pyproject.toml.

@aantn
Copy link
Collaborator Author

aantn commented Jan 7, 2026

/eval
filter: 183

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

@aantn Your eval run has finished. ✅ Completed successfully


🧪 Manual Eval Results

Parameter Value
Triggered via /eval comment
Branch claude/add-counting-tool-PA7Nz
Model bedrock/eu.anthropic.claude-sonnet-4-5-20250929-v1:0
Markers all LLM tests
Filter (-k) 183
Iterations 1
Duration 1m 58s
Workflow View logs | Rerun

Results of HolmesGPT evals

  • ask_holmes: 7/8 test cases were successful, 0 regressions, 1 setup failures
Status Test case Time Turns Tools Cost
🚧 183_count_configmaps_per_namespace[0]
183a_elasticsearch_cluster_health 11.2s 3 3 $0.0970
183b_elasticsearch_index_discovery 10.6s 3 3 $0.0958
183c_elasticsearch_log_search 19.2s 5 6 $0.1262
183d_elasticsearch_aggregation 21.3s 6 8 $0.1323
183e_elasticsearch_field_mappings 11.9s 3 3 $0.0980
183f_elasticsearch_shard_filtering 13.0s 3 3 $0.1003
183g_elasticsearch_index_stats 13.3s 3 3 $0.1032
Total 14.4s avg 3.7 avg 4.1 avg $0.7528

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'master'

Status: Success - 18 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
markers: regression

Or with more options (one per line):

/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
markers: regression
Option Description
model Model(s) to test (default: same as automatic runs)
markers Pytest markers (no default - runs all tests!)
filter Pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

🏷️ Valid markers

benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

- Renamed test from 183 to 184 (183 was already taken)
- Updated namespaces to app-184-alpha and app-184-beta
- Fixed expected counts to 50 and 63 (accounts for default
  kube-root-ca.crt ConfigMap added by Kubernetes)

Signed-off-by: Claude <noreply@anthropic.com>
@aantn
Copy link
Collaborator Author

aantn commented Jan 10, 2026

/eval
filter: 184

@github-actions
Copy link
Contributor

@aantn Your eval run has finished. ✅ Completed successfully


🧪 Manual Eval Results

Parameter Value
Triggered via /eval comment
Branch claude/add-counting-tool-PA7Nz
Model bedrock/eu.anthropic.claude-sonnet-4-5-20250929-v1:0
Markers all LLM tests
Filter (-k) 184
Iterations 1
Duration 2m 44s
Workflow View logs | Rerun

Results of HolmesGPT evals

  • ask_holmes: 2/2 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
184_count_configmaps_per_namespace[0] 11.3s 3 4 $0.0415
184_elasticsearch_index_explosion 16.8s 4 4 $0.1085
Total 14.0s avg 3.5 avg 4.0 avg $0.1500

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'master'

Status: Success - 9 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
markers: regression

Or with more options (one per line):

/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
markers: regression
Option Description
model Model(s) to test (default: same as automatic runs)
markers Pytest markers (no default - runs all tests!)
filter Pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

🏷️ Valid markers

benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, frontend, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Prevents LLM from guessing count by looking at max sequential number.
Names are now like cm-a1b2c3d4e5f6 instead of config-1, config-2, etc.

Signed-off-by: Claude <noreply@anthropic.com>
@aantn
Copy link
Collaborator Author

aantn commented Jan 10, 2026

/eval
filter: 184
iterations: 2

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
@tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml:
- Around line 1-11: The test case is missing the required runbooks field; update
the YAML for this test (near the top-level keys like user_prompt and
expected_output) to include a runbooks entry (e.g., an empty catalog) so all
eval tests declare runbooks even when none are provided; ensure the new runbooks
field appears at the top level alongside user_prompt, expected_output, and tags.
- Line 1: The test directory and file are named
184_count_configmaps_per_namespace but that test number is already used; rename
the directory and all file references from 184_count_configmaps_per_namespace to
192_count_configmaps_per_namespace (including the YAML "user_prompt" file and
any imports/CI references) so the test number sequence is unique and consistent.
🧹 Nitpick comments (1)
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml (1)

4-6: Consider adding include_tool_calls to verify tool usage.

The expected output checks for exact counts (50 and 63), which are specific. However, since counting ConfigMaps requires tool usage, consider adding include_tool_calls: true to explicitly verify the counting tool was invoked rather than relying solely on the numeric output.

Based on learnings, use include_tool_calls when output values alone might not rule out hallucinations.

🔧 Optional enhancement
 expected_output:
   - The answer must state that app-184-alpha has exactly 50 ConfigMaps
   - The answer must state that app-184-beta has exactly 63 ConfigMaps
+
+include_tool_calls: true
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f07dad7 and 98b6b69.

📒 Files selected for processing (1)
  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
🧰 Additional context used
📓 Path-based instructions (4)
tests/llm/**/*.{py,yaml}

📄 CodeRabbit inference engine (CLAUDE.md)

All pod names must be unique across tests (never reuse pod names between tests)

Files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
tests/llm/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/**/*.yaml: Never use resource names that hint at the problem or expected behavior in evals (avoid broken-pod, test-project-that-does-not-exist, crashloop-app)
Only use valid tags from pyproject.toml for LLM tests - invalid tags cause test collection failures
Use exit 1 when setup verification fails to fail the test early
Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
Use kubectl exec over port forwarding for setup verification to avoid port conflicts
Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)
Use retry loops for kubectl wait to handle race conditions, don't use bare kubectl wait immediately after resource creation
Use realistic logs in eval tests, not fake/obvious logs like 'Memory usage stabilized at 800MB'
Use realistic filenames in eval tests, not hints like 'disk_consumer.py' - use names like 'training_pipeline.py'
Use real-world scenarios in eval tests (ML pipelines with checkpoint issues, database connection pools) not simulated scenarios

Files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/test_ask_holmes/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Use sequential test numbers for eval tests, checking existing tests for next available number

Files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/fixtures/**/*.yaml: Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Never use :latest container tags - use specific versions like grafana/grafana:12.3.1
Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Match user prompt to test - prompt must explicitly request what you're testing
Don't use technical terms that give away solutions in user prompts - use anti-cheat prompts that prevent domain knowledge shortcuts
Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context
Use include_tool_calls: true to verify tool was called when output values are too generic to rule out hallucinations
Use neutral, application-specific names in eval resources instead of obvious technical terms to prevent domain knowledge cheats
Avoid hint-giving resource names - use realistic business context (checkout-api, user-service, inventory-db) not obvious problem indicators (broken-pod, payment-service-1)
Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
Toolset config in eval tests must go under config field: toolsets.toolset_name.enabled.config

Files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/shared/**/*.yaml : Create shared infrastructure manifest in tests/llm/fixtures/shared/servicename.yaml when multiple tests use the same service
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Be specific in expected_output - test exact values like title or unique injected values, not generic patterns

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Match user prompt to test - prompt must explicitly request what you're testing

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/**/*.yaml : Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/shared/**/*.yaml : Create shared infrastructure manifest in tests/llm/fixtures/shared/servicename.yaml when multiple tests use the same service

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/**/*.yaml : Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness

Applied to files:

  • tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.12)
  • GitHub Check: llm_evals
🔇 Additional comments (5)
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml (5)

18-32: LGTM! Parallel ConfigMap creation with anti-cheat measures.

The parallel creation of ConfigMaps using random UUID names effectively prevents counting by predictable sequential patterns, which aligns with the anti-cheat requirements. The explicit comments about the default kube-root-ca.crt ConfigMap add clarity to the expected counts.


34-50: LGTM! Proper setup verification with exit 1.

The verification logic correctly checks the expected counts and fails the test early with exit 1 if the setup doesn't match expectations, following the guideline for setup verification failures.


52-54: LGTM! Proper cleanup.

The cleanup script correctly removes test namespaces with idempotent deletion using || true.


8-11: All tags (kubernetes, counting, toolset-limitation) are valid according to pyproject.toml and will not cause test collection failures.


1-1: No action needed. Infrastructure.yaml is not required for eval tests in this codebase. Analysis of 186 test_case.yaml files in test_ask_holmes shows no infrastructure.yaml files exist anywhere in tests/llm, and test 184 correctly follows the established pattern.

Likely an incorrect or invalid review comment.

Comment on lines 1 to 11
user_prompt:
- "How many ConfigMaps are in namespace app-184-alpha and how many are in app-184-beta? Give me the exact counts."

expected_output:
- The answer must state that app-184-alpha has exactly 50 ConfigMaps
- The answer must state that app-184-beta has exactly 63 ConfigMaps

tags:
- kubernetes
- counting
- toolset-limitation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add missing runbooks field.

The test_case.yaml is missing the required runbooks field. Even if no custom runbooks are needed, include an empty catalog.

Based on learnings, all eval tests should include a runbooks field in test_case.yaml.

➕ Proposed fix
 tags:
   - kubernetes
   - counting
   - toolset-limitation
+
+runbooks: {}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
user_prompt:
- "How many ConfigMaps are in namespace app-184-alpha and how many are in app-184-beta? Give me the exact counts."
expected_output:
- The answer must state that app-184-alpha has exactly 50 ConfigMaps
- The answer must state that app-184-beta has exactly 63 ConfigMaps
tags:
- kubernetes
- counting
- toolset-limitation
user_prompt:
- "How many ConfigMaps are in namespace app-184-alpha and how many are in app-184-beta? Give me the exact counts."
expected_output:
- The answer must state that app-184-alpha has exactly 50 ConfigMaps
- The answer must state that app-184-beta has exactly 63 ConfigMaps
tags:
- kubernetes
- counting
- toolset-limitation
runbooks: {}
🤖 Prompt for AI Agents
In
@tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
around lines 1 - 11, The test case is missing the required runbooks field;
update the YAML for this test (near the top-level keys like user_prompt and
expected_output) to include a runbooks entry (e.g., an empty catalog) so all
eval tests declare runbooks even when none are provided; ensure the new runbooks
field appears at the top level alongside user_prompt, expected_output, and tags.

@@ -0,0 +1,54 @@
user_prompt:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# List all test directories in test_ask_holmes and show the last 20 to verify test number sequence

echo "All test directories in test_ask_holmes (last 20):"
fd -t d '^[0-9]+_' tests/llm/fixtures/test_ask_holmes/ | sed 's|.*test_ask_holmes/||' | sed 's|_.*||' | sort -n | tail -20

echo ""
echo "Highest test number found:"
fd -t d '^[0-9]+_' tests/llm/fixtures/test_ask_holmes/ | sed 's|.*test_ask_holmes/||' | sed 's|_.*||' | sort -n | tail -1

Repository: HolmesGPT/holmesgpt

Length of output: 225


🏁 Script executed:

#!/bin/bash
# Check what test directories exist for numbers around the gap
echo "Checking for test 183:"
ls -la tests/llm/fixtures/test_ask_holmes/ | grep '^d' | grep '183'

echo ""
echo "Checking for test 184:"
ls -la tests/llm/fixtures/test_ask_holmes/ | grep '^d' | grep '184'

echo ""
echo "Checking for test 192:"
ls -la tests/llm/fixtures/test_ask_holmes/ | grep '^d' | grep '192'

Repository: HolmesGPT/holmesgpt

Length of output: 891


Rename test directory to 192_count_configmaps_per_namespace; test number 184 is already in use.

Test number 184 is already used by the existing test 184_elasticsearch_index_explosion. The next available sequential test number is 192 (filling the gap after 191). Rename the test directory and file references accordingly.

🤖 Prompt for AI Agents
In
@tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
at line 1, The test directory and file are named
184_count_configmaps_per_namespace but that test number is already used; rename
the directory and all file references from 184_count_configmaps_per_namespace to
192_count_configmaps_per_namespace (including the YAML "user_prompt" file and
any imports/CI references) so the test number sequence is unique and consistent.

@github-actions
Copy link
Contributor

@aantn Your eval run has finished. ✅ Completed successfully


🧪 Manual Eval Results

Parameter Value
Triggered via /eval comment
Branch claude/add-counting-tool-PA7Nz
Model bedrock/eu.anthropic.claude-sonnet-4-5-20250929-v1:0
Markers all LLM tests
Filter (-k) 184
Iterations 2
Duration 2m 44s
Workflow View logs | Rerun

Results of HolmesGPT evals

  • ask_holmes: 4/4 test cases were successful, 0 regressions
Status Test case Time Turns Tools Cost
184_count_configmaps_per_namespace[0] 10.9s 3 4 $0.0451
184_count_configmaps_per_namespace[0] 11.8s 3 4 $0.0450
184_elasticsearch_index_explosion 20.1s 5 5 $0.1194
184_elasticsearch_index_explosion 25.3s 6 6 $0.1301
Total 17.0s avg 4.2 avg 4.8 avg $0.3396

Historical comparison unavailable: No historical metrics found (no passing tests with duration data, excluding branch 'master')

Historical Comparison Details

Filter: excluding branch 'master'

Status: No historical metrics found (no passing tests with duration data, excluding branch 'master')

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)
📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
markers: regression

Or with more options (one per line):

/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
markers: regression
Option Description
model Model(s) to test (default: same as automatic runs)
markers Pytest markers (no default - runs all tests!)
filter Pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

🏷️ Valid markers

benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, frontend, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Updated all namespace references from app-184-* to app-195-*.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Claude <noreply@anthropic.com>
@aantn
Copy link
Collaborator Author

aantn commented Jan 10, 2026

/eval
filter: 195
iterations: 2

@github-actions
Copy link
Contributor

@aantn Your eval run has finished. ⚠️ Completed with 2 failures


🧪 Manual Eval Results

Parameter Value
Triggered via /eval comment
Branch claude/add-counting-tool-PA7Nz
Model bedrock/eu.anthropic.claude-sonnet-4-5-20250929-v1:0
Markers all LLM tests
Filter (-k) 195
Iterations 2
Duration 1m 42s
Workflow View logs | Rerun

Results of HolmesGPT evals

  • ask_holmes: 0/2 test cases were successful, 2 regressions
Status Test case Time Turns Tools Cost
195_count_configmaps_per_namespace[0] 17.9s 4 6 $0.1450
195_count_configmaps_per_namespace[0] 17.6s 4 6 $0.1496
Total 17.7s avg 4.0 avg 6.0 avg $0.2946

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'master'

Status: Success - 9 test/model combinations loaded

Experiments compared (30):

Comparison indicators:

  • ±0% — diff under 10% (within noise threshold)
  • ↑N%/↓N% — diff 10-25%
  • ↑N%/↓N% — diff over 25% (significant)

⚠️ 2 Failures Detected

📖 Legend
Icon Meaning
The test was successful
The test was skipped
⚠️ The test failed but is known to be flaky or known to fail
🚧 The test had a setup failure (not a code regression)
🔧 The test failed due to mock data issues (not a code regression)
🚫 The test was throttled by API rate limits/overload
The test failed and should be fixed before merging the PR
🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:

gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
markers: regression

Or with more options (one per line):

/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
markers: regression
Option Description
model Model(s) to test (default: same as automatic runs)
markers Pytest markers (no default - runs all tests!)
filter Pytest -k filter (use /list to see valid eval names)
iterations Number of runs, max 10
branch Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

🏷️ Valid markers

benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, frontend, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency


Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants