-
Notifications
You must be signed in to change notification settings - Fork 228
Add eval test for counting ConfigMaps per namespace #1337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This test creates 49 ConfigMaps in app-183-alpha and 62 in app-183-beta, then asks Holmes to count them per namespace. Tagged with toolset-limitation since there's no proper grouping/aggregation tool - the LLM must fetch raw lists and count manually, which is error-prone for large datasets. Signed-off-by: Claude <noreply@anthropic.com>
✅ Deploy Preview for holmes-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
📂 Previous Runs📜 Run @ 64bd141 (#20870620770)✅ Results of HolmesGPT evalsAutomatically triggered by commit 64bd141 on branch Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'claude/add-counting-tool-PA7Nz' Status: Success - 11 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📜 Run @ 98b6b69 (#20870444675)✅ Results of HolmesGPT evalsAutomatically triggered by commit 98b6b69 on branch Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'claude/add-counting-tool-PA7Nz' Status: Success - 11 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📜 Run @ be77775 (#20870100133)✅ Results of HolmesGPT evalsAutomatically triggered by commit be77775 on branch Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'claude/add-counting-tool-PA7Nz' Status: Success - 11 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📜 Run @ 9cf4973 (#20870069850)✅ Results of HolmesGPT evalsAutomatically triggered by commit 9cf4973 on branch Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'claude/add-counting-tool-PA7Nz' Status: Success - 11 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📜 Run @ 127b5a1 (#20774730506)✅ Results of HolmesGPT evalsAutomatically triggered by commit 127b5a1 on branch Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'claude/add-counting-tool-PA7Nz' Status: Success - 26 test/model combinations loaded Experiments compared (30):
Comparison indicators:
✅ Results of HolmesGPT evalsAutomatically triggered by commit ba53c68 on branch Results of HolmesGPT evals
Historical comparison unavailable: No historical metrics found (no passing tests with duration data, excluding branch 'claude/add-counting-tool-PA7Nz') Historical Comparison DetailsFilter: excluding branch 'claude/add-counting-tool-PA7Nz' Status: No historical metrics found (no passing tests with duration data, excluding branch 'claude/add-counting-tool-PA7Nz') Experiments compared (30):
Comparison indicators:
📖 Legend
🔄 Re-run evals manually
Option 1: Comment on this PR with Or with more options (one per line): Run evals on a different branch (e.g., master) for comparison:
Quick re-run: Use Option 2: Trigger via GitHub Actions UI → "Run workflow" 🏷️ Valid markers
Commands: CLI: |
WalkthroughAdds a new test case fixture file for testing ConfigMap counting across Kubernetes namespaces. The test case includes setup scripts to create 49 ConfigMaps in one namespace and 62 in another, validation logic to verify counts, and cleanup scripts to remove test resources. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
✅ Docker image ready for
Use this tag to pull the image for testing. 📋 Copy commandsgcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:ae53bca me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ae53bca
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:ae53bcaPatch Helm values in one line (choose the chart you use): HolmesGPT chart: helm upgrade --install holmesgpt ./helm/holmes \
--set registry=me-west1-docker.pkg.dev/robusta-development/development \
--set image=holmes-dev:ae53bcaRobusta wrapper chart: helm upgrade --install robusta robusta/robusta \
--reuse-values \
--set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
--set holmes.image=holmes-dev:ae53bca |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
@tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml:
- Around line 1-52: Add a top-level runbooks field to the YAML test case by
inserting runbooks: {} at the root of the file (alongside user_prompt,
expected_output, tags, before_test, after_test) so the test includes the
required empty runbooks object; ensure it's placed at the top level, not nested
under another key.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
🧰 Additional context used
📓 Path-based instructions (4)
tests/llm/**/*.{py,yaml}
📄 CodeRabbit inference engine (CLAUDE.md)
All pod names must be unique across tests (never reuse pod names between tests)
Files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
tests/llm/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/**/*.yaml: Never use resource names that hint at the problem or expected behavior in evals (avoid broken-pod, test-project-that-does-not-exist, crashloop-app)
Only use valid tags from pyproject.toml for LLM tests - invalid tags cause test collection failures
Use exit 1 when setup verification fails to fail the test early
Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
Use kubectl exec over port forwarding for setup verification to avoid port conflicts
Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)
Use retry loops for kubectl wait to handle race conditions, don't use bare kubectl wait immediately after resource creation
Use realistic logs in eval tests, not fake/obvious logs like 'Memory usage stabilized at 800MB'
Use realistic filenames in eval tests, not hints like 'disk_consumer.py' - use names like 'training_pipeline.py'
Use real-world scenarios in eval tests (ML pipelines with checkpoint issues, database connection pools) not simulated scenarios
Files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/test_ask_holmes/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
Use sequential test numbers for eval tests, checking existing tests for next available number
Files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/fixtures/**/*.yaml: Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Never use :latest container tags - use specific versions like grafana/grafana:12.3.1
Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Match user prompt to test - prompt must explicitly request what you're testing
Don't use technical terms that give away solutions in user prompts - use anti-cheat prompts that prevent domain knowledge shortcuts
Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context
Use include_tool_calls: true to verify tool was called when output values are too generic to rule out hallucinations
Use neutral, application-specific names in eval resources instead of obvious technical terms to prevent domain knowledge cheats
Avoid hint-giving resource names - use realistic business context (checkout-api, user-service, inventory-db) not obvious problem indicators (broken-pod, payment-service-1)
Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
Toolset config in eval tests must go under config field: toolsets.toolset_name.enabled.config
Files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/**/*.yaml : Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Match user prompt to test - prompt must explicitly request what you're testing
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.208Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.208Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Applied to files:
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: llm_evals
- GitHub Check: build
- GitHub Check: build (3.12)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
🔇 Additional comments (4)
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml (4)
18-30: LGTM - Efficient parallel ConfigMap creation.The parallel creation with background processes and
waitis efficient. The verification step (lines 32-46) ensures the correct counts are achieved.
32-48: LGTM - Proper verification with early exit on failure.The verification correctly checks exact ConfigMap counts and uses
exit 1to fail the test early if setup doesn't match expected state, as per coding guidelines.
14-16: Namespace names are unique across all tests. ✓
1-11: Test number 183 is properly sequential, and all tags (kubernetes, counting, toolset-limitation) are valid per pyproject.toml.
tests/llm/fixtures/test_ask_holmes/183_count_configmaps_per_namespace/test_case.yaml
Outdated
Show resolved
Hide resolved
|
/eval |
|
@aantn Your eval run has finished. ✅ Completed successfully 🧪 Manual Eval Results
Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'master' Status: Success - 18 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📖 Legend
🔄 Re-run evals manually
Option 1: Comment on this PR with Or with more options (one per line): Run evals on a different branch (e.g., master) for comparison:
Quick re-run: Use Option 2: Trigger via GitHub Actions UI → "Run workflow" 🏷️ Valid markers
Commands: CLI: |
- Renamed test from 183 to 184 (183 was already taken) - Updated namespaces to app-184-alpha and app-184-beta - Fixed expected counts to 50 and 63 (accounts for default kube-root-ca.crt ConfigMap added by Kubernetes) Signed-off-by: Claude <noreply@anthropic.com>
|
/eval |
|
@aantn Your eval run has finished. ✅ Completed successfully 🧪 Manual Eval Results
Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'master' Status: Success - 9 test/model combinations loaded Experiments compared (30):
Comparison indicators:
📖 Legend
🔄 Re-run evals manually
Option 1: Comment on this PR with Or with more options (one per line): Run evals on a different branch (e.g., master) for comparison:
Quick re-run: Use Option 2: Trigger via GitHub Actions UI → "Run workflow" 🏷️ Valid markers
Commands: CLI: |
Prevents LLM from guessing count by looking at max sequential number. Names are now like cm-a1b2c3d4e5f6 instead of config-1, config-2, etc. Signed-off-by: Claude <noreply@anthropic.com>
|
/eval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
@tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml:
- Around line 1-11: The test case is missing the required runbooks field; update
the YAML for this test (near the top-level keys like user_prompt and
expected_output) to include a runbooks entry (e.g., an empty catalog) so all
eval tests declare runbooks even when none are provided; ensure the new runbooks
field appears at the top level alongside user_prompt, expected_output, and tags.
- Line 1: The test directory and file are named
184_count_configmaps_per_namespace but that test number is already used; rename
the directory and all file references from 184_count_configmaps_per_namespace to
192_count_configmaps_per_namespace (including the YAML "user_prompt" file and
any imports/CI references) so the test number sequence is unique and consistent.
🧹 Nitpick comments (1)
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml (1)
4-6: Consider adding include_tool_calls to verify tool usage.The expected output checks for exact counts (50 and 63), which are specific. However, since counting ConfigMaps requires tool usage, consider adding
include_tool_calls: trueto explicitly verify the counting tool was invoked rather than relying solely on the numeric output.Based on learnings, use include_tool_calls when output values alone might not rule out hallucinations.
🔧 Optional enhancement
expected_output: - The answer must state that app-184-alpha has exactly 50 ConfigMaps - The answer must state that app-184-beta has exactly 63 ConfigMaps + +include_tool_calls: true
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
🧰 Additional context used
📓 Path-based instructions (4)
tests/llm/**/*.{py,yaml}
📄 CodeRabbit inference engine (CLAUDE.md)
All pod names must be unique across tests (never reuse pod names between tests)
Files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
tests/llm/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/**/*.yaml: Never use resource names that hint at the problem or expected behavior in evals (avoid broken-pod, test-project-that-does-not-exist, crashloop-app)
Only use valid tags from pyproject.toml for LLM tests - invalid tags cause test collection failures
Use exit 1 when setup verification fails to fail the test early
Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
Use kubectl exec over port forwarding for setup verification to avoid port conflicts
Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)
Use retry loops for kubectl wait to handle race conditions, don't use bare kubectl wait immediately after resource creation
Use realistic logs in eval tests, not fake/obvious logs like 'Memory usage stabilized at 800MB'
Use realistic filenames in eval tests, not hints like 'disk_consumer.py' - use names like 'training_pipeline.py'
Use real-world scenarios in eval tests (ML pipelines with checkpoint issues, database connection pools) not simulated scenarios
Files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/test_ask_holmes/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
Use sequential test numbers for eval tests, checking existing tests for next available number
Files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
tests/llm/fixtures/**/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
tests/llm/fixtures/**/*.yaml: Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Never use :latest container tags - use specific versions like grafana/grafana:12.3.1
Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Match user prompt to test - prompt must explicitly request what you're testing
Don't use technical terms that give away solutions in user prompts - use anti-cheat prompts that prevent domain knowledge shortcuts
Test discovery and analysis ability, not recognition - Holmes should search/analyze, not guess from context
Use include_tool_calls: true to verify tool was called when output values are too generic to rule out hallucinations
Use neutral, application-specific names in eval resources instead of obvious technical terms to prevent domain knowledge cheats
Avoid hint-giving resource names - use realistic business context (checkout-api, user-service, inventory-db) not obvious problem indicators (broken-pod, payment-service-1)
Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
Toolset config in eval tests must go under config field: toolsets.toolset_name.enabled.config
Files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom runbooks in eval tests: Add runbooks field in test_case.yaml (use runbooks: {} for empty catalog)
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/shared/**/*.yaml : Create shared infrastructure manifest in tests/llm/fixtures/shared/servicename.yaml when multiple tests use the same service
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Custom toolsets for eval tests: Create separate toolsets.yaml file, never put toolset config in test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Be specific in expected_output - test exact values like title or unique injected values, not generic patterns
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Use Secrets for scripts in eval test manifests, not inline manifests or ConfigMaps
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/test_ask_holmes/**/*.yaml : Use sequential test numbers for eval tests, checking existing tests for next available number
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Match user prompt to test - prompt must explicitly request what you're testing
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Add source comments in eval test manifests for anti-cheat: 'Uses Node Exporter dashboard but renamed to prevent cheats'
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Implement full architecture in eval tests even if complex (use Loki for log aggregation, proper separation of concerns) with minimal resource footprints
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/**/*.yaml : Required files for eval tests: test_case.yaml, infrastructure manifests, and toolsets.yaml (if needed)
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/**/*.yaml : Use sleep 1 instead of sleep 5 for retry loops, remove unnecessary sleeps, reduce timeout values (60s for pod readiness, 30s for API verification)
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/fixtures/shared/**/*.yaml : Create shared infrastructure manifest in tests/llm/fixtures/shared/servicename.yaml when multiple tests use the same service
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
📚 Learning: 2026-01-05T11:14:20.222Z
Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-05T11:14:20.222Z
Learning: Applies to tests/llm/**/*.yaml : Poll real API endpoints and check for expected content in setup verification, don't just test pod readiness
Applied to files:
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build (3.10)
- GitHub Check: build (3.11)
- GitHub Check: build (3.12)
- GitHub Check: llm_evals
🔇 Additional comments (5)
tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml (5)
18-32: LGTM! Parallel ConfigMap creation with anti-cheat measures.The parallel creation of ConfigMaps using random UUID names effectively prevents counting by predictable sequential patterns, which aligns with the anti-cheat requirements. The explicit comments about the default kube-root-ca.crt ConfigMap add clarity to the expected counts.
34-50: LGTM! Proper setup verification with exit 1.The verification logic correctly checks the expected counts and fails the test early with
exit 1if the setup doesn't match expectations, following the guideline for setup verification failures.
52-54: LGTM! Proper cleanup.The cleanup script correctly removes test namespaces with idempotent deletion using
|| true.
8-11: All tags (kubernetes,counting,toolset-limitation) are valid according to pyproject.toml and will not cause test collection failures.
1-1: No action needed. Infrastructure.yaml is not required for eval tests in this codebase. Analysis of 186 test_case.yaml files in test_ask_holmes shows no infrastructure.yaml files exist anywhere in tests/llm, and test 184 correctly follows the established pattern.Likely an incorrect or invalid review comment.
| user_prompt: | ||
| - "How many ConfigMaps are in namespace app-184-alpha and how many are in app-184-beta? Give me the exact counts." | ||
|
|
||
| expected_output: | ||
| - The answer must state that app-184-alpha has exactly 50 ConfigMaps | ||
| - The answer must state that app-184-beta has exactly 63 ConfigMaps | ||
|
|
||
| tags: | ||
| - kubernetes | ||
| - counting | ||
| - toolset-limitation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing runbooks field.
The test_case.yaml is missing the required runbooks field. Even if no custom runbooks are needed, include an empty catalog.
Based on learnings, all eval tests should include a runbooks field in test_case.yaml.
➕ Proposed fix
tags:
- kubernetes
- counting
- toolset-limitation
+
+runbooks: {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| user_prompt: | |
| - "How many ConfigMaps are in namespace app-184-alpha and how many are in app-184-beta? Give me the exact counts." | |
| expected_output: | |
| - The answer must state that app-184-alpha has exactly 50 ConfigMaps | |
| - The answer must state that app-184-beta has exactly 63 ConfigMaps | |
| tags: | |
| - kubernetes | |
| - counting | |
| - toolset-limitation | |
| user_prompt: | |
| - "How many ConfigMaps are in namespace app-184-alpha and how many are in app-184-beta? Give me the exact counts." | |
| expected_output: | |
| - The answer must state that app-184-alpha has exactly 50 ConfigMaps | |
| - The answer must state that app-184-beta has exactly 63 ConfigMaps | |
| tags: | |
| - kubernetes | |
| - counting | |
| - toolset-limitation | |
| runbooks: {} |
🤖 Prompt for AI Agents
In
@tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
around lines 1 - 11, The test case is missing the required runbooks field;
update the YAML for this test (near the top-level keys like user_prompt and
expected_output) to include a runbooks entry (e.g., an empty catalog) so all
eval tests declare runbooks even when none are provided; ensure the new runbooks
field appears at the top level alongside user_prompt, expected_output, and tags.
| @@ -0,0 +1,54 @@ | |||
| user_prompt: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# List all test directories in test_ask_holmes and show the last 20 to verify test number sequence
echo "All test directories in test_ask_holmes (last 20):"
fd -t d '^[0-9]+_' tests/llm/fixtures/test_ask_holmes/ | sed 's|.*test_ask_holmes/||' | sed 's|_.*||' | sort -n | tail -20
echo ""
echo "Highest test number found:"
fd -t d '^[0-9]+_' tests/llm/fixtures/test_ask_holmes/ | sed 's|.*test_ask_holmes/||' | sed 's|_.*||' | sort -n | tail -1Repository: HolmesGPT/holmesgpt
Length of output: 225
🏁 Script executed:
#!/bin/bash
# Check what test directories exist for numbers around the gap
echo "Checking for test 183:"
ls -la tests/llm/fixtures/test_ask_holmes/ | grep '^d' | grep '183'
echo ""
echo "Checking for test 184:"
ls -la tests/llm/fixtures/test_ask_holmes/ | grep '^d' | grep '184'
echo ""
echo "Checking for test 192:"
ls -la tests/llm/fixtures/test_ask_holmes/ | grep '^d' | grep '192'Repository: HolmesGPT/holmesgpt
Length of output: 891
Rename test directory to 192_count_configmaps_per_namespace; test number 184 is already in use.
Test number 184 is already used by the existing test 184_elasticsearch_index_explosion. The next available sequential test number is 192 (filling the gap after 191). Rename the test directory and file references accordingly.
🤖 Prompt for AI Agents
In
@tests/llm/fixtures/test_ask_holmes/184_count_configmaps_per_namespace/test_case.yaml
at line 1, The test directory and file are named
184_count_configmaps_per_namespace but that test number is already used; rename
the directory and all file references from 184_count_configmaps_per_namespace to
192_count_configmaps_per_namespace (including the YAML "user_prompt" file and
any imports/CI references) so the test number sequence is unique and consistent.
|
@aantn Your eval run has finished. ✅ Completed successfully 🧪 Manual Eval Results
Results of HolmesGPT evals
Historical comparison unavailable: No historical metrics found (no passing tests with duration data, excluding branch 'master') Historical Comparison DetailsFilter: excluding branch 'master' Status: No historical metrics found (no passing tests with duration data, excluding branch 'master') Experiments compared (30):
Comparison indicators:
📖 Legend
🔄 Re-run evals manually
Option 1: Comment on this PR with Or with more options (one per line): Run evals on a different branch (e.g., master) for comparison:
Quick re-run: Use Option 2: Trigger via GitHub Actions UI → "Run workflow" 🏷️ Valid markers
Commands: CLI: |
Updated all namespace references from app-184-* to app-195-*. Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Claude <noreply@anthropic.com>
|
/eval |
|
@aantn Your eval run has finished. 🧪 Manual Eval Results
Results of HolmesGPT evals
Time/Cost columns show % change vs historical average (↑slower/costlier, ↓faster/cheaper). Changes under 10% shown as ±0%. Historical Comparison DetailsFilter: excluding branch 'master' Status: Success - 9 test/model combinations loaded Experiments compared (30):
Comparison indicators:
|
| Icon | Meaning |
|---|---|
| ✅ | The test was successful |
| ➖ | The test was skipped |
| The test failed but is known to be flaky or known to fail | |
| 🚧 | The test had a setup failure (not a code regression) |
| 🔧 | The test failed due to mock data issues (not a code regression) |
| 🚫 | The test was throttled by API rate limits/overload |
| ❌ | The test failed and should be fixed before merging the PR |
🔄 Re-run evals manually
⚠️ Warning:/evalcomments always run using the workflow from master, not from this PR branch. If you modified the GitHub Action (e.g., added secrets or env vars), those changes won't take effect.To test workflow changes, use the GitHub CLI or Actions UI instead:
gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=
Option 1: Comment on this PR with /eval:
/eval
markers: regression
Or with more options (one per line):
/eval
model: gpt-4o
markers: regression
filter: 09_crashpod
iterations: 5
Run evals on a different branch (e.g., master) for comparison:
/eval
branch: master
markers: regression
| Option | Description |
|---|---|
model |
Model(s) to test (default: same as automatic runs) |
markers |
Pytest markers (no default - runs all tests!) |
filter |
Pytest -k filter (use /list to see valid eval names) |
iterations |
Number of runs, max 10 |
branch |
Run evals on a different branch (for cross-branch comparison) |
Quick re-run: Use /rerun to re-run the most recent /eval on this PR with the same parameters.
Option 2: Trigger via GitHub Actions UI → "Run workflow"
🏷️ Valid markers
benchmark, chain-of-causation, compaction, context_window, coralogix, counting, database, datadog, datetime, easy, elasticsearch, embeds, frontend, grafana-dashboard, hard, kafka, kubernetes, leaked-information, logs, loki, medium, metrics, network, newrelic, no-cicd, numerical, one-test, port-forward, prometheus, question-answer, regression, runbooks, slackbot, storage, toolset-limitation, traces, transparency
Commands: /eval · /rerun · /list
CLI: gh workflow run eval-regression.yaml --repo HolmesGPT/holmesgpt --ref claude/add-counting-tool-PA7Nz -f markers=regression -f filter=
Adds eval 195
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.