Skip to content

TT-16826 OTel Metrics MCP Dimensions#7889

Merged
tbuchaillot merged 4 commits intomasterfrom
opentelemetry/mcp-dimensions
Mar 18, 2026
Merged

TT-16826 OTel Metrics MCP Dimensions#7889
tbuchaillot merged 4 commits intomasterfrom
opentelemetry/mcp-dimensions

Conversation

@tbuchaillot
Copy link
Copy Markdown
Contributor

@tbuchaillot tbuchaillot commented Mar 13, 2026

Description

Adds MCP-specific dimensions to the OTel custom metrics system, enabling users to observe MCP proxy traffic (tool calls, resource reads,
prompt invocations) through configurable OTel metric instruments.

Tyk's MCP proxy already introspects JSON-RPC payloads at the middleware layer — this PR surfaces that data (method, primitive type, primitive
name, error code) as metric dimensions aligned with the OTel mcp.* semantic
conventions
.

Context Propagation

  • 4 new context keys (MCPMethod, MCPPrimitiveType, MCPPrimitiveName, JSONRPCErrorCode) with ctxSet/ctxGet helpers following the
    existing gateway pattern
  • JSONRPCMiddleware.ProcessRequest() propagates MCP fields right after routing state is resolved — before access control, so rejected
    requests still carry full MCP context
  • writeJSONRPCError() updated to accept *http.Request and stash the JSON-RPC error code in context. Same for writeJSONRPCErrorResponse()
    in the error handler

Dimension Extractors

  • 4 MCP fields added to RequestContext and 4 metadata extractors registered:
    • mcp_method → JSON-RPC method name (tools/call, resources/read, initialize, etc.)
    • mcp_primitive_typetool, resource, prompt, or ""
    • mcp_primitive_name → specific tool/resource/prompt name
    • mcp_error_code → JSON-RPC error code as string, empty when no error
  • NeedsMCP() flag on the registry (follows existing NeedsSession()/NeedsContext() pattern) — MCP fields only populated at the recording
    site when MCP dimensions are configured, zero overhead otherwise

New Dimensions

Metadata Key OTel Label Values Cardinality
mcp_method mcp.method.name tools/call, resources/read, prompts/get, initialize, ping, ... Low (~15)
mcp_primitive_type mcp.primitive.type tool, resource, prompt, "" Very Low (4)
mcp_primitive_name mcp.primitive.name Tool/resource/prompt names Medium (bounded by access control + cardinality limits)
mcp_error_code mcp.error.code -32700, -32600, -32601, -32602, -32603, "" Very Low (~5)

Additionally, mcp.session.id is already available via existing header source ({"source": "header", "key": "Mcp-Session-Id", "label": "mcp.session.id"}) with no code changes.

Configuration Example

TYK_GW_OPENTELEMETRY_METRICS_APIMETRICS='[
  {                                                                                                                                           
    "name": "tyk.mcp.requests.total",
    "type": "counter",                                                                                                                        
    "dimensions": [                                                  
      {"source": "metadata", "key": "mcp_method", "label": "mcp.method.name"},
      {"source": "metadata", "key": "mcp_primitive_type", "label": "mcp.primitive.type"},                                                     
      {"source": "metadata", "key": "mcp_primitive_name", "label": "mcp.primitive.name"},
      {"source": "metadata", "key": "mcp_error_code", "label": "mcp.error.code"},                                                             
      {"source": "metadata", "key": "response_code"},                                                                                         
      {"source": "metadata", "key": "api_id"}                                                                                                 
    ]                                                                                                                                         
  }                                                                                                                                           
]'                                                                                                                                            

Design Decisions

  • ctxSet/ctxGet over direct struct access: MCP data propagated through request context rather than read directly from JSONRPCRoutingState at
    the recording site — makes data available to all downstream consumers (audit logs, traces, custom plugins), not just metrics
  • Set before access control: Rejected requests still carry full MCP context for metrics and error correlation
  • NeedsMCP flag: Zero overhead for non-MCP metric configurations
  • VEM loop safety: MCP context set once on first pass — JSONRPCMiddleware short-circuits on subsequent VEM hops

Related Issue

Motivation and Context

How This Has Been Tested

Unit tests

E2E Tests (ci/tests/metrics/metrics_test.go, profile: mcp)

Docker-compose stack: Go test client → Tyk Gateway → tzolov/mcp-everything-server:v3 → OTel Collector → Prometheus

8 test cases:

  1. MCP metric emissiontools/call produces tyk_mcp_requests_total{mcp_method_name="tools/call",mcp_primitive_type="tool"}
  2. Primitive nameecho tool call produces tyk_mcp_requests_total{mcp_primitive_name="echo"}
  3. Non-MCP isolation — REST traffic to non-MCP API does not appear in MCP counter
  4. Histogram durationtyk_mcp_primitive_duration_seconds_count{mcp_primitive_type="tool",mcp_primitive_name="echo"} exists after tool
    calls
  5. HTTP duration with MCP labelhttp_server_request_duration_seconds_count{mcp_method_name="tools/call"} exists
  6. Session ID from headerMcp-Session-Id: test-session-123 header surfaces as mcp_session_id="test-session-123" label
  7. Initialize method — Non-primitive method produces tyk_mcp_requests_total{mcp_method_name="initialize"} with no primitive type
  8. All expected labels — Full MCP session verifies all labels present: api_id, mcp_method_name, mcp_primitive_type,
    mcp_primitive_name, response_code

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Refactoring or add test (improvements in base code or adds test coverage to functionality)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

@tbuchaillot tbuchaillot requested a review from a team as a code owner March 13, 2026 16:28
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 13, 2026

API Changes

--- prev.txt	2026-03-18 08:09:47.743226233 +0000
+++ current.txt	2026-03-18 08:09:39.873261384 +0000
@@ -8593,6 +8593,14 @@
 	JSONRPCRoutingState
 	// MCPRouting indicates the request came via MCP JSON-RPC routing
 	MCPRouting
+	// MCPMethod stores the JSON-RPC method name for MCP metrics dimensions.
+	MCPMethod
+	// MCPPrimitiveType stores the MCP primitive type (tool/resource/prompt) for metrics dimensions.
+	MCPPrimitiveType
+	// MCPPrimitiveName stores the MCP primitive name for metrics dimensions.
+	MCPPrimitiveName
+	// JSONRPCErrorCode stores the JSON-RPC error code for metrics dimensions.
+	JSONRPCErrorCode
 )
 # Package: ./dlpython
 

@probelabs
Copy link
Copy Markdown
Contributor

probelabs bot commented Mar 13, 2026

This PR introduces OpenTelemetry (OTel) metric dimensions for Multi-Cloud Protocol (MCP) traffic, enabling detailed observability into MCP-specific operations like tool calls, resource reads, and prompt invocations.

Key changes include:

  • Context Propagation: The JSONRPCMiddleware now parses MCP requests and propagates details (MCPMethod, MCPPrimitiveType, MCPPrimitiveName, JSONRPCErrorCode) through the http.Request context. This data is made available before access control checks, ensuring even rejected requests are instrumented.
  • Dimension Extractors: New metadata extractors (mcp_method, mcp_primitive_type, mcp_primitive_name, mcp_error_code) are added to the OTel metrics system to surface the context data as metric dimensions.
  • Performance Optimization: A NeedsMCP() flag is introduced in the metric registry. This ensures that MCP-related context is only populated and read if a user has explicitly configured a metric that requires it, avoiding any performance overhead for non-MCP APIs or configurations without MCP dimensions.
  • Error Instrumentation: Error handling in the JSON-RPC middleware and the global error handler has been updated to stash the JSON-RPC error code in the request context, allowing it to be used as a metric dimension.
  • End-to-End Testing: A new mcp profile has been added to the CI metrics test suite, complete with a mock MCP server, new API definitions, and a suite of tests to validate the emission of the new metrics.

Files Changed Analysis

The PR introduces 1075 lines of code across 19 files, with the bulk of the changes focused on adding the new feature and its corresponding tests.

  • ci/tests/metrics/: A significant portion of the changes are here, establishing a new end-to-end test environment for the feature. This includes a new test profile (mcp), Docker Compose configuration for a mock mcp-server, test API definitions, and extensive new test cases in metrics_test.go.
  • ctx/: New context keys are defined in ctx.go to standardize how MCP data is passed within the request lifecycle.
  • gateway/: This contains the core implementation logic. mw_jsonrpc.go is updated to parse requests and enrich the context. middleware.go is modified to read this enriched context when recording metrics, guarded by the NeedsMCP() check. A comprehensive new unit test file, mcp_dimensions_test.go, validates this new logic.
  • internal/otel/apimetrics/: This is where the integration with the OpenTelemetry metrics system occurs. The RequestContext is extended, new dimension extractors are added, and the InstrumentRegistry is updated to detect when MCP dimensions are in use to enable the NeedsMCP() performance gate.

Architecture & Impact Assessment

What this PR accomplishes

This PR significantly enhances the observability of Tyk's MCP proxy. It allows operators to gain deep insights into MCP traffic by creating detailed, dimensional metrics based on the specific MCP methods and primitives being invoked. This enables fine-grained monitoring, alerting, and dashboarding that was not previously possible.

Key technical changes introduced

  1. Context-Driven Data Flow: The use of http.Request context to pass data between middleware layers is a clean, decoupled approach that follows Go best practices.
  2. Configuration-Driven Activation: The feature is implicitly enabled only when a user configures a custom metric with MCP dimensions. This NeedsMCP() check is a crucial performance optimization that ensures zero overhead for users not utilizing this feature.
  3. Extensible Metrics Framework: The changes demonstrate the flexibility of the existing custom metrics system by adding new metadata sources without requiring fundamental architectural changes.

Affected system components

  • Tyk Gateway: The primary impact is on the JSON-RPC middleware and the OpenTelemetry metrics recording pipeline.
  • Observability Stack: Downstream monitoring systems (e.g., Prometheus, Grafana) will be able to ingest and visualize these new, more granular metrics.
  • Gateway Configuration: Users must update their gateway configuration to define custom metrics with the new dimensions to leverage this feature.

Component Interaction Flow

sequenceDiagram
    participant Client
    participant TykGateway as "Tyk Gateway"
    participant JSONRPCMiddleware as "JSON-RPC Middleware"
    participant BaseMiddleware as "Base Middleware"
    participant OTelInstruments as "OTel Instruments"

    Client->>+TykGateway: POST /mcp (JSON-RPC Request)
    TykGateway->>+JSONRPCMiddleware: ProcessRequest(r)
    JSONRPCMiddleware->>JSONRPCMiddleware: Parse request, extract MCP data
    JSONRPCMiddleware->>TykGateway: ctxSetMCPMethod(r, "tools/call")<br/>ctxSetMCPPrimitiveType(r, "tool")
    Note right of JSONRPCMiddleware: Request context is now enriched
    JSONRPCMiddleware-->>TykGateway: Continue middleware chain
    TykGateway->>+BaseMiddleware: RecordMetrics(r)
    alt if MetricInstruments.NeedsMCP() is true
        BaseMiddleware->>TykGateway: Populate RequestContext with MCP data from context
        BaseMiddleware->>+OTelInstruments: RecordAPIMetrics(RequestContext)
        OTelInstruments->>OTelInstruments: Extract MCP dimensions from RequestContext
        OTelInstruments-->>BaseMiddleware:
    end
    BaseMiddleware-->>TykGateway:
    TykGateway-->>-Client: HTTP Response
Loading

Scope Discovery & Context Expansion

The changes are well-contained within the gateway's request processing and metrics subsystems. The architectural pattern—enriching request context in a protocol-specific middleware for consumption by a generic downstream middleware (like metrics or logging)—is a robust and reusable pattern. This could serve as a blueprint for adding detailed observability for other protocols like GraphQL or gRPC in the future.

The implementation correctly builds upon the existing custom metrics framework (TYK_GW_OPENTELEMETRY_METRICS_APIMETRICS), reinforcing its flexibility and power.

Metadata
  • Review Effort: 3 / 5
  • Primary Label: feature

Powered by Visor from Probelabs

Last updated: 2026-03-18T08:10:34.870Z | Triggered by: pr_updated | Commit: a777121

💡 TIP: You can chat with Visor using /visor ask <your question>

@probelabs
Copy link
Copy Markdown
Contributor

probelabs bot commented Mar 13, 2026

Security Issues (1)

Severity Location Issue
🟡 Warning gateway/mw_jsonrpc.go:190-195
A potential high-cardinality issue exists for the `mcp_primitive_name` metric dimension. The primitive name is extracted from the user-provided JSON-RPC request body and stored in the request context *before* any access control checks are performed. If this dimension is enabled in the OpenTelemetry configuration, an unauthenticated attacker can send requests with arbitrary primitive names, leading to a large number of unique time series in the metrics backend (e.g., Prometheus). This can cause excessive memory consumption and a denial-of-service condition for the monitoring system.
💡 SuggestionTo mitigate the risk of cardinality explosion, apply access control *before* propagating the primitive name to the request context for metrics. The `mcp.Route` function already determines the primitive name. After this step, and before setting the context value, verify that the primitive is allowed for the given API key or policy. If the primitive is not allowed, either do not set the `mcp_primitive_name` in the context or set it to a low-cardinality placeholder value like "<denied>". This ensures that only valid, permitted primitive names are recorded as metric dimensions.

✅ Architecture Check Passed

No architecture issues found – changes LGTM.

Security Issues (1)

Severity Location Issue
🟡 Warning gateway/mw_jsonrpc.go:190-195
A potential high-cardinality issue exists for the `mcp_primitive_name` metric dimension. The primitive name is extracted from the user-provided JSON-RPC request body and stored in the request context *before* any access control checks are performed. If this dimension is enabled in the OpenTelemetry configuration, an unauthenticated attacker can send requests with arbitrary primitive names, leading to a large number of unique time series in the metrics backend (e.g., Prometheus). This can cause excessive memory consumption and a denial-of-service condition for the monitoring system.
💡 SuggestionTo mitigate the risk of cardinality explosion, apply access control *before* propagating the primitive name to the request context for metrics. The `mcp.Route` function already determines the primitive name. After this step, and before setting the context value, verify that the primitive is allowed for the given API key or policy. If the primitive is not allowed, either do not set the `mcp_primitive_name` in the context or set it to a low-cardinality placeholder value like "<denied>". This ensures that only valid, permitted primitive names are recorded as metric dimensions.
\n\n ### ✅ Architecture Check Passed

No architecture issues found – changes LGTM.

\n\n

Performance Issues (1)

Severity Location Issue
🟡 Warning gateway/mw_jsonrpc.go:193-198
Context values for MCP dimensions are set on every MCP request, regardless of whether any metrics are configured to use them. Each `ctxSet...` call involves creating a new `http.Request` and `context.Context` object, which adds allocation overhead and GC pressure on the hot path.
💡 SuggestionTo avoid this overhead for users not using MCP metrics, the context-setting logic should be made conditional. Wrap the block in an `if m.Gw.MetricInstruments.NeedsMCP()` check. This aligns the cost of collecting this data with its actual use, following the pattern used for reading the data in `gateway/middleware.go`.

Quality Issues (1)

Severity Location Issue
🟡 Warning ci/tests/metrics/metrics_test.go:816
The end-to-end test suite for MCP metrics lacks coverage for requests that result in a JSON-RPC error. While unit tests cover the logic for stashing and extracting error codes, an end-to-end test is needed to verify that the `mcp_error_code` dimension is correctly populated and exported through the entire observability pipeline to Prometheus.
💡 SuggestionAdd a new test case to the `mcp` profile that intentionally triggers a JSON-RPC error (e.g., by calling a non-existent tool or providing invalid parameters) and asserts that the resulting metric in Prometheus contains the correct `mcp_error_code` label. This will ensure the error handling aspect of the feature is fully validated.

Powered by Visor from Probelabs

Last updated: 2026-03-18T08:10:30.382Z | Triggered by: pr_updated | Commit: a777121

💡 TIP: You can chat with Visor using /visor ask <your question>

@tbuchaillot tbuchaillot changed the title OTel Metrics MCP Dimensions TT-16826 OTel Metrics MCP Dimensions Mar 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🚨 Jira Linter Failed

Commit: a777121
Failed at: 2026-03-18 08:09:05 UTC

The Jira linter failed to validate your PR. Please check the error details below:

🔍 Click to view error details
failed to validate branch and PR title rules: branch name 'opentelemetry/mcp-dimensions' must contain a valid Jira ticket ID (e.g., ABC-123)

Next Steps

  • Ensure your branch name contains a valid Jira ticket ID (e.g., ABC-123)
  • Verify your PR title matches the branch's Jira ticket ID
  • Check that the Jira ticket exists and is accessible

This comment will be automatically deleted once the linter passes.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
97.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

@tbuchaillot tbuchaillot enabled auto-merge (squash) March 18, 2026 08:56
@tbuchaillot tbuchaillot merged commit 8aa4199 into master Mar 18, 2026
51 of 52 checks passed
@tbuchaillot tbuchaillot deleted the opentelemetry/mcp-dimensions branch March 18, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants