Add eval test for counting ConfigMaps per namespace #1337

aantn · 2026-01-07T06:04:47Z

Adds eval 195

Summary by CodeRabbit

Tests
- Added test case for validating ConfigMap counting across multiple Kubernetes namespaces with automated setup, execution, and cleanup procedures.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This test creates 49 ConfigMaps in app-183-alpha and 62 in app-183-beta, then asks Holmes to count them per namespace. Tagged with toolset-limitation since there's no proper grouping/aggregation tool - the LLM must fetch raw lists and count manually, which is error-prone for large datasets. Signed-off-by: Claude <noreply@anthropic.com>

netlify · 2026-01-07T06:04:53Z

✅ Deploy Preview for holmes-docs ready!

Name	Link
🔨 Latest commit	`ba53c68`
🔍 Latest deploy log	https://app.netlify.com/projects/holmes-docs/deploys/6961aceee0e8ca0008e39d97
😎 Deploy Preview	https://deploy-preview-1337--holmes-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

linux-foundation-easycla · 2026-01-07T06:04:55Z

✅ login: aantn / name: Natan Yellin (be77775)
❌ - login: @claude / name: Claude . The commit (64bd141, 98b6b69, 9cf4973, ba53c68) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

github-actions · 2026-01-07T06:04:58Z

📂 Previous Runs

📜 Run @ 64bd141 (#20870620770)

✅ Results of HolmesGPT evals

Automatically triggered by commit 64bd141 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost
✅	09_crashpod	30.2s ↓11%	5	11	$0.0959
✅	101_loki_historical_logs_pod_deleted	47.3s ↓21%	7	13	$0.1993
✅	111_pod_names_contain_service	43.8s ±0%	8	20	$0.1527
✅	12_job_crashing	51.0s ±0%	9	18	$0.1739
✅	162_get_runbooks	51.4s ±0%	8	17	$0.2318
✅	176_network_policy_blocking_traffic_no_runbooks	38.4s ±0%	6	15	$0.1707
✅	24_misconfigured_pvc	40.2s ±0%	7	17	$0.1276
✅	43_current_datetime_from_prompt	3.3s ±0%	1	—	$0.0085
✅	61_exact_match_counting	10.6s ±0%	3	3	$0.0326
	Total	35.1s avg	6.0 avg	14.2 avg	$1.1931

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

github-20870645717.1860.1 (branch: master)
github-20870465666.1857.1 (branch: master)
github-20870356294.1855.1 (branch: claude/prometheus-ssl-error-messaging-Bjk5W)
...and 27 more

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📜 Run @ 98b6b69 (#20870444675)

✅ Results of HolmesGPT evals

Automatically triggered by commit 98b6b69 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost
✅	09_crashpod	34.4s ±0%	6	13	$0.1701
✅	101_loki_historical_logs_pod_deleted	42.2s ↓29%	6	11	$0.1738
✅	111_pod_names_contain_service	40.0s ±0%	7	15	$0.1738
✅	12_job_crashing	54.8s ±0%	9	23	$0.2599
✅	162_get_runbooks	47.6s ↓10%	8	15	$0.2278
✅	176_network_policy_blocking_traffic_no_runbooks	35.4s ↓16%	6	14	$0.1707
✅	24_misconfigured_pvc	38.5s ±0%	7	16	$0.1805
✅	43_current_datetime_from_prompt	4.1s ↑15%	1	—	$0.0618
✅	61_exact_match_counting	10.9s ±0%	3	3	$0.0858
	Total	34.2s avg	5.9 avg	13.8 avg	$1.5042

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

github-20870465666.1857.1 (branch: master)
github-20870356294.1855.1 (branch: claude/prometheus-ssl-error-messaging-Bjk5W)
github-20870099131.1853.1 (branch: master)
...and 27 more

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📜 Run @ be77775 (#20870100133)

✅ Results of HolmesGPT evals

Automatically triggered by commit be77775 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost
✅	09_crashpod	33.1s ±0%	5	12	$0.1525
✅	101_loki_historical_logs_pod_deleted	58.7s ±0%	9	15	$0.2264
✅	111_pod_names_contain_service	50.3s ↑23%	8	17	$0.1949
✅	12_job_crashing	58.1s ±0%	10	18	$0.2176
✅	162_get_runbooks	55.8s ±0%	8	16	$0.2240
✅	176_network_policy_blocking_traffic_no_runbooks	56.0s ↑32%	7	15	$0.1874
✅	24_misconfigured_pvc	45.1s ↑14%	7	18	$0.1816
✅	43_current_datetime_from_prompt	4.0s ↑12%	1	—	$0.0618
✅	61_exact_match_counting	14.1s ↑22%	3	3	$0.0860
	Total	41.7s avg	6.4 avg	14.2 avg	$1.5321

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

github-20870099131.1853.1 (branch: master)
github-20870057515.1851.1-3203ed39 (branch: claude/add-verify-ssl-option-zKHFW)
github-20870057515.1851.1-63ef6cda (branch: claude/add-verify-ssl-option-zKHFW)
...and 27 more

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📜 Run @ 9cf4973 (#20870069850)

✅ Results of HolmesGPT evals

Automatically triggered by commit 9cf4973 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost
✅	09_crashpod	37.7s	6	14	$0.1205
✅	101_loki_historical_logs_pod_deleted	46.1s	6	11	$0.1242
✅	111_pod_names_contain_service	38.0s	6	13	$0.1034
✅	12_job_crashing	60.2s	9	23	$0.1957
✅	162_get_runbooks	44.6s	6	16	$0.1486
✅	176_network_policy_blocking_traffic_no_runbooks	44.4s	7	16	$0.1365
✅	24_misconfigured_pvc	48.1s	8	18	$0.1431
✅	43_current_datetime_from_prompt	4.0s	1	—	$0.0085
✅	61_exact_match_counting	12.8s	3	3	$0.0326
	Total	37.3s avg	5.8 avg	14.2 avg	$1.0132

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 11 test/model combinations loaded

Experiments compared (30):

github-20870099131.1853.1 (branch: master)
github-20870057515.1851.1-3203ed39 (branch: claude/add-verify-ssl-option-zKHFW)
github-20870057515.1851.1-63ef6cda (branch: claude/add-verify-ssl-option-zKHFW)
...and 27 more

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📜 Run @ 127b5a1 (#20774730506)

✅ Results of HolmesGPT evals

Automatically triggered by commit 127b5a1 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost
✅	09_crashpod	31.8s ±0%	6	13	$0.1674
✅	101_loki_historical_logs_pod_deleted	50.6s ±0%	8	17	$0.2190
✅	111_pod_names_contain_service	37.8s ±0%	7	16	$0.1842
✅	12_job_crashing	51.1s ±0%	9	20	$0.2464
✅	162_get_runbooks	53.4s ↑13%	8	17	$0.2440
✅	176_network_policy_blocking_traffic_no_runbooks	38.5s ±0%	7	15	$0.1938
✅	24_misconfigured_pvc	36.8s ±0%	7	16	$0.1765
✅	43_current_datetime_from_prompt	3.0s ±0%	1	—	$0.0618
✅	61_exact_match_counting	10.8s ±0%	3	3	$0.0870
	Total	34.9s avg	6.2 avg	14.6 avg	$1.5800

Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%.

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: Success - 26 test/model combinations loaded

Experiments compared (30):

github-20774741560.1783.1 (branch: master)
github-20772252625.1772.1 (branch: claude/prometheus-ssl-error-messaging-Bjk5W)
github-20765544550.1768.1-e4be3db9 (branch: mcp-config-fix)
...and 27 more

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

✅ Results of HolmesGPT evals

Automatically triggered by commit ba53c68 on branch claude/add-counting-tool-PA7Nz

View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Status	Test case	Time	Turns	Tools	Cost
✅	09_crashpod	30.7s	5	11	$0.0981
✅	101_loki_historical_logs_pod_deleted	43.7s	8	11	$0.1217
✅	111_pod_names_contain_service	53.1s	9	20	$0.1601
✅	12_job_crashing	49.0s	8	18	$0.1596
✅	162_get_runbooks	52.5s	8	16	$0.1588
✅	176_network_policy_blocking_traffic_no_runbooks	49.1s	8	16	$0.1443
✅	24_misconfigured_pvc	39.1s	7	16	$0.1205
✅	43_current_datetime_from_prompt	3.1s	1	—	$0.0085
✅	61_exact_match_counting	11.8s	3	3	$0.0326
	Total	36.9s avg	6.3 avg	13.9 avg	$1.0042

Historical comparison unavailable: No historical metrics found (no passing tests with duration data, excluding branch 'claude/add-counting-tool-PA7Nz')

Historical Comparison Details

Filter: excluding branch 'claude/add-counting-tool-PA7Nz'

Status: No historical metrics found (no passing tests with duration data, excluding branch 'claude/add-counting-tool-PA7Nz')

Experiments compared (30):

github-20870686620.1870.1 (branch: master)
github-20870718289.1875.1 (branch: claude/prometheus-rules-response-5u8P4)
github-20870683374.1866.1 (branch: claude/eval-failure-classification-U1wIR)
...and 27 more

Comparison indicators:

±0% — diff under 10% (within noise threshold)
↑N%/↓N% — diff 10-25%
↑N%/↓N% — diff over 25% (significant)

📖 Legend

Icon	Meaning
✅	The test was successful
➖	The test was skipped
⚠️	The test failed but is known to be flaky or known to fail
🚧	The test had a setup failure (not a code regression)
🔧	The test failed due to mock data issues (not a code regression)
🚫	The test was throttled by API rate limits/overload
❌	The test failed and should be fixed before merging the PR

🔄 Re-run evals manually

⚠️ Warning: /eval comments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.

To test workflow changes, use the GitHub CLI or Actions UI instead:
gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

Option 1: Comment on this PR with /eval:

/eval
markers: regression

Or with more options (one per line):

/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5

Run evals on a different branch (e.g., master) for comparison:

/eval
branch: master
markers: regression

Option	Description
`model`	Model(s) to test (default: same as automatic runs)
`markers`	Pytest markers (no default - runs all tests!)
`filter`	Pytest -k filter (use `/list` to see valid eval names)
`iterations`	Number of runs, max 10
`branch`	Run evals on a different branch (for cross-branch comparison)

Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.

Option 2: Trigger via GitHub Actions UI → "Run workflow"

🏷️ Valid markers

benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, frontend, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency

Commands: /eval · /rerun · /list

CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=

coderabbitai · 2026-01-07T06:04:59Z

Walkthrough

Adds a new test case fixture file for testing ConfigMap counting across Kubernetes namespaces. The test case includes setup scripts to create 49 ConfigMaps in one namespace and 62 in another, validation logic to verify counts, and cleanup scripts to remove test resources.

Changes

Cohort / File(s)	Summary
Test Case Fixture `tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml`	New test case YAML fixture defining: user prompt for ConfigMap counts across two namespaces (app-184-alpha and app-184-beta), expected outputs (50 and 63 respectively), before_test script for namespace and ConfigMap creation with parallel execution and count validation, and after_test cleanup script.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

moshemorad

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add eval test for counting ConfigMaps per namespace' directly and clearly summarizes the main change—a new test case YAML fixture that evaluates counting ConfigMaps across namespaces.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-01-07T06:05:02Z

✅ Docker image ready for ae53bca (built in 41s)

us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use this tag to pull the image for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ae53bca
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ae53bca

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:ae53bca

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:ae53bca

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
@tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml:
- Around line 1-52: Add a top-level runbooks field to the YAML test case by
inserting runbooks: {} at the root of the file (alongside user_prompt,
expected_output, tags, before_test, after_test) so the test includes the
required empty runbooks object; ensure it's placed at the top level, not nested
under another key.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54feab6 and f07dad7.

📒 Files selected for processing (1)

tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml

🧰 Additional context used

📓 Path-based instructions (4)

tests/llm/**/*.{py,yaml}

📄 CodeRabbit inference engine (CLAUDE.md)

All pod names must be unique across tests (never reuse pod names between tests)

Files:

tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml

tests/llm/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/**/*.yaml: Never use resource names that hint at the problem or expected behavior in evals (avoid broken-pod, test-project-that-does-not-exist, crashloop-app)
Only use valid tags from pyproject.toml for LLM tests - invalid tags cause test collection failures
Use exit 1 when setup verification fails to fail the test early
Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
Use kubectl exec over port forwarding for setup verification to avoid port conflicts
Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)
Use retry loops for kubectl wait to handle race conditions, don't use bare kubectl wait immediately after resource creation
Use realistic logs in eval tests, not fake/obvious logs like 'Memory usage stabilized at 800MB'
Use realistic filenames in eval tests, not hints like 'disk_consumer.py' - use names like 'training_pipeline.py'
Use real-world scenarios in eval tests (ML pipelines with checkpoint issues, database connection pools) not simulated scenarios

Files:

tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml

tests/llm/fixtures/test_ask_holmes/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

Use sequential test numbers for eval tests, checking existing tests for next available number

Files:

tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml

tests/llm/fixtures/**/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

tests/llm/fixtures/**/*.yaml: Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Never use :latest container tags - use specific versions like grafana/grafana:12.3.1
Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Match user prompt to test - prompt must explicitly request what you're testing
Don't use technical terms that give away solutions in user prompts - use anti-cheat prompts that prevent domain knowledge shortcuts
Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context
Use include_tool_calls: true to verify tool was called when output values are too generic to rule out hallucinations
Use neutral, application-specific names in eval resources instead of obvious technical terms to prevent domain knowledge cheats
Avoid hint-giving resource names - use realistic business context (checkout-api, user-service, inventory-db) not obvious problem indicators (broken-pod, payment-service-1)
Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
Toolset config in eval tests must go under config field: toolsets.toolset_name.enabled.config

Files:

tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml

🧠 Learnings (11)

📓 Common learnings

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/**/*.yaml : Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness