Skip to content

Conversation

@songy23
Copy link
Member

@songy23 songy23 commented Jan 13, 2026

What does this PR do?

Add OTel Agent Gateway feature implementation. Reference implementation in helm charts: http://github.com/DataDog/helm-charts/blob/main/charts/datadog/templates/otel-agent-gateway-deployment.yaml

Motivation

Support OTel Agent Gateway in operator. See how it currently works in helm charts: https://docs.datadoghq.com/opentelemetry/setup/ddot_collector/install/kubernetes_gateway/

Additional Notes

HPA is currently not natively supported in operator, we may require users to define HPA as a separate resource

Replicas, node selector, affinity and other configs will be supported in the next PR.

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: v7.74.0

Describe your test plan

Built the operator locally and deployed to a kind cluster with OTel Agent Gateway enabled.

For QA, use the following config:

apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
  name: datadog-otel-test
  namespace: datadog
spec:
  global:
    clusterName: kind-cluster
    credentials: ...
    site: datadoghq.com
  features:
    # Enable OTel Agent Gateway feature (standalone gateway deployment)
    otelAgentGateway:
      enabled: true
      # Use default ports: 4317 (gRPC) and 4318 (HTTP)

    # Enable OTel Collector feature (runs in each node agent pod)
    otelCollector:
      enabled: true
      # Collector will send data to the gateway service

Verify:

  1. k8s deployment datadog-otel-test-otel-agent-gateway is created
  2. k8s service datadog-otel-test-otel-agent-gateway is created and listening on 4317/TCP, 4318/TCP
  3. pod datadog-otel-test-otel-agent-gateway-<pod> is created and running
  4. k8s endpoint datadog-otel-test-otel-agent-gateway binds to the pod above
  5. the otel-agent container in the node agent pod has a config that sends to the k8s service above
  6. cluster role datadog-otel-test-otel-agent-gateway and its binding are created

Additionally, deploy the OTel test client below. Ideally you have a multi-node cluster and deploy the client to a different node than the one that gateway runs on, to test the cross-node routing.

apiVersion: v1
kind: Pod
metadata:
  name: otel-test-client
  namespace: datadog
spec:
  nodeSelector:
    kubernetes.io/hostname: kind-worker # replace if needed to make sure test client runs on a different node
  containers:
  - name: telemetrygen
    image: ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest
    command:
    - /telemetrygen
    args:
    - traces
    - --otlp-endpoint=datadog-otel-test-agent.datadog.svc.cluster.local:4317
    - --otlp-insecure
    - --duration=60s
    - --rate=1
    - --service=test-service
  restartPolicy: Never

It sends OTLP data to the otel-agent in the node agent pod (NOT the otel agent gateway): datadog-otel-test-agent 4317. Then the otel-agent in the node agent pod forwards the data to the otel agent gateway, and gateway sends the data eventually to DD.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

@songy23 songy23 added this to the v1.23.0 milestone Jan 13, 2026
@songy23 songy23 added the enhancement New feature or request label Jan 13, 2026
@codecov-commenter
Copy link

codecov-commenter commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 36.74912% with 179 lines in your changes missing coverage. Please review.
✅ Project coverage is 38.16%. Comparing base (8518b87) to head (936a0a3).

Files with missing lines Patch % Lines
...controller/datadogagent/global/otelagentgateway.go 0.00% 48 Missing ⚠️
...ature/otelcollector/defaultconfig/defaultconfig.go 0.00% 40 Missing ⚠️
pkg/testutils/builder.go 0.00% 33 Missing ⚠️
...datadogagent/component/otelagentgateway/default.go 0.00% 23 Missing ⚠️
...er/datadogagent/component/otelagentgateway/rbac.go 0.00% 23 Missing ⚠️
...r/datadogagent/feature/otelagentgateway/feature.go 90.19% 5 Missing and 5 partials ⚠️
...ogagentinternal/controller_reconcile_v2_helpers.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2483      +/-   ##
==========================================
+ Coverage   38.09%   38.16%   +0.06%     
==========================================
  Files         299      300       +1     
  Lines       25182    25461     +279     
==========================================
+ Hits         9594     9718     +124     
- Misses      14853    15002     +149     
- Partials      735      741       +6     
Flag Coverage Δ
unittests 38.16% <36.74%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ller/datadogagent/feature/otelcollector/feature.go 76.71% <100.00%> (+1.23%) ⬆️
pkg/images/images.go 95.27% <ø> (ø)
...ogagentinternal/controller_reconcile_v2_helpers.go 29.50% <0.00%> (-1.01%) ⬇️
...r/datadogagent/feature/otelagentgateway/feature.go 90.40% <90.19%> (+90.40%) ⬆️
...datadogagent/component/otelagentgateway/default.go 0.00% <0.00%> (ø)
...er/datadogagent/component/otelagentgateway/rbac.go 0.00% <0.00%> (ø)
pkg/testutils/builder.go 0.00% <0.00%> (ø)
...ature/otelcollector/defaultconfig/defaultconfig.go 0.00% <0.00%> (ø)
...controller/datadogagent/global/otelagentgateway.go 0.00% <0.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8518b87...936a0a3. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@songy23 songy23 force-pushed the yang.song/OTAGENT-512 branch from b0aeb8a to 262ecb3 Compare January 13, 2026 02:13
@songy23 songy23 force-pushed the yang.song/OTAGENT-512 branch from 262ecb3 to e33ef74 Compare January 13, 2026 13:36
@songy23 songy23 marked this pull request as ready for review January 13, 2026 19:54
@songy23 songy23 requested a review from a team as a code owner January 13, 2026 19:54
Comment on lines 179 to 188
internalTrafficPolicy := corev1.ServiceInternalTrafficPolicyLocal
if err := managers.ServiceManager().AddService(
f.localServiceName,
f.owner.GetNamespace(),
common.GetOtelAgentGatewayServiceSelector(f.owner),
[]corev1.ServicePort{*otlpGrpcPort, *otlpHttpPort},
&internalTrafficPolicy,
); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you do NOT want to use an internal traffic policy, that's intended for daemonset componnents, as that would cause pods to not see any endpoints if they're not on the same node as the gateway deployment. You wouldn't notice on a 1-node kind cluster since everything is on the same node : https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/#how-it-works
So it should remain as nil, and you'd need to update QA instructions to test this scenario properly (2 nodes cluster, deploy client on non gateway node, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - yes the gateway should use cluster traffic (the default). I've fixed this and tested it's working in a 3-node kind cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated QA too

@tbavelier tbavelier merged commit cc18a4e into main Jan 15, 2026
32 checks passed
@tbavelier tbavelier deleted the yang.song/OTAGENT-512 branch January 15, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants