Skip to content

Conversation

@khrm
Copy link
Contributor

@khrm khrm commented Sep 29, 2025

Changes

  • Updated pipelinerunmetrics and taskrunmetrics to use OpenTelemetry instruments (histograms, counters, gauges) for creating and recording metrics.
    Introduced new OpenTelemetry configurations in config/config-observability.yaml for exporters and protocols..
    Rewrote the test suites for pipelinerunmetrics and taskrunmetrics to be compatible with the new OpenTelemetry-based implementation.
  • Updated knative to 1.19

fixes #8969

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • pre-commit Passed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Breaking Changes in metrics

  1. Infrastructure Metric Renaming
  Infrastructure metrics (Go runtime, Workqueue, K8s Client) have been renamed from the tekton_pipelines_controller_ prefix to standard OpenTelemetry/Knative namespaces.


  ┌────────────┬─────────────────────────────────────────────────────────┬─────────────────────────────────────────┬──────────────────────────────────┐
  │ Category   │ Old Metric Name (OpenCensus)                            │ New Metric Name (OpenTelemetry)         │ Changes                          │
  ├────────────┼─────────────────────────────────────────────────────────┼─────────────────────────────────────────┼──────────────────────────────────┤
  │ Workqueue  │ tekton_pipelines_controller_workqueue_adds_total            │ kn_workqueue_adds_total                   │ Prefix change                    │
  │            │ tekton_pipelines_controller_workqueue_depth                 │ kn_workqueue_depth                        │ Prefix change                    │
  │            │ tekton_pipelines_controller_workqueue_queue_latency_seconds   │ kn_workqueue_queue_duration_seconds         │ Renamed latency -> duration      │
  │            │ tekton_pipelines_controller_workqueue_work_duration_seconds   │ kn_workqueue_process_duration_seconds       │ Renamed work -> process          │
  │            │ tekton_pipelines_controller_workqueue_retries_total         │ kn_workqueue_retries_total                │ Prefix change                    │
  │            │ tekton_pipelines_controller_workqueue_unfinished_work_seconds │ kn_workqueue_unfinished_work_seconds        │ Prefix change                    │
  │ K8s Client │ tekton_pipelines_controller_client_latency                  │ http_client_request_duration_seconds        │ Standard HTTP metric             │
  │            │ tekton_pipelines_controller_client_results                  │ kn_k8s_client_http_response_status_code_total │ Detailed status tracking         │
  │ Go Runtime │ tekton_pipelines_controller_go_*                            │ go_* (e.g. go_goroutines)                 │ Standard Prometheus Go collector │
  └────────────┴─────────────────────────────────────────────────────────┴─────────────────────────────────────────┴──────────────────────────────────┘


  2. Label Changes in Tekton Metrics
  While the core metric names for PipelineRuns and TaskRuns have been preserved, some labels have changed to provide more detail or align with OTel standards.

   * `tekton_pipelines_controller_pipelinerun_duration_seconds`:
       * Added: reason label (e.g., Completed, Succeeded). This allows for more granular breakdown of durations but increases cardinality.
   * `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds`:
       * Added: reason label.

  3. Removed Metrics
   * `tekton_pipelines_controller_reconcile_count` and `tekton_pipelines_controller_reconcile_latency`: These custom reconcile metrics are no longer emitted. Users should rely on
     kn_workqueue_process_duration_seconds and kn_workqueue_adds_total to monitor controller performance.

  What Remained Compatible
  The following critical metrics have been explicitly preserved to minimize disruption:
   * TaskRun Pod Latency: tekton_pipelines_controller_taskruns_pod_latency_milliseconds remains a Gauge (preserving behavior despite OTel defaults preferring Histograms).
   * Total Counters: tekton_pipelines_controller_pipelinerun_total and tekton_pipelines_controller_taskrun_total retain their original labels (status) and do not include the high-cardinality
     reason label.
   * Bucket Boundaries: Duration histograms (e.g., taskrun_duration_seconds) retain their specific explicit buckets (10s, 30s, 1m, etc.) instead of defaulting to OTel's millisecond-focused
     buckets.

 Upgrade Actions
   1. Update Config: Ensure your config-observability ConfigMap uses metrics-protocol: prometheus (or grpc/http) instead of the old metrics.backend-destination. If prometheus was being used, then no need to do any changes.
   2. Update Dashboards:
       * Replace tekton_pipelines_controller_workqueue_* queries with kn_workqueue_*.
       * Replace tekton_pipelines_controller_go_* queries with standard go_* metrics.
       * Check aggregations on pipelinerun_duration_seconds to account for the new reason label (use sum by (label) if necessary to aggregate away the new dimension).

/kind feature

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesnt merit a release note. labels Sep 29, 2025
@tekton-robot tekton-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 29, 2025
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@khrm khrm force-pushed the mig-otel branch 2 times, most recently from bab5a44 to a2ea129 Compare September 29, 2025 10:53
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@waveywaves
Copy link
Member

/retest

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 86.7% 68.8% -17.9
pkg/reconciler/pipelinerun/pipelinerun.go 91.6% 91.6% -0.0
pkg/taskrunmetrics/metrics.go 87.3% 72.8% -14.5

@waveywaves
Copy link
Member

/retest

@vdemeester vdemeester added this to the v1.7.0 milestone Oct 1, 2025
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2025
@vdemeester vdemeester modified the milestones: v1.7.0, v1.8.0 Dec 3, 2025
@enarha
Copy link

enarha commented Dec 15, 2025

Re knative bump, we discussed multiple times that we want knative/pkg@04fdd0b included in the next bump, even though it only exists in main for now (not even in 1.20). It'll unlock the better finalizer management which proved to be an issue for Tekton deployments with multiple controllers managing the same PR/TR resources. So do we want to bump knative/pkg even higher?

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2025
@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

/assign @vdemeester @waveywaves

@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

@enarha I have updated to latest knative/pkg for now.

@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

Why are ci tests being skipped?

@khrm
Copy link
Contributor Author

khrm commented Dec 16, 2025

/kind feature

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2026
@khrm
Copy link
Contributor Author

khrm commented Jan 12, 2026

Could you check the metric related to throttled task runs. The values are accumulating in every loop increasing indefinitely instead of capturing the correct current value.

@anithapriyanatarajan Can you recheck this?

Copy link
Contributor

@anithapriyanatarajan anithapriyanatarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some tests in test/ folder are still using opencensus. Could you assess that and remove opencensus from go.mod.

}

var statsReporterOptions []webhook.StatsReporterOption
enableNamespace := os.Getenv("WEBHOOK_METRICS_ENABLE_NAMESPACE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khrm - Is this a breaking change for users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following knative metrics - reconcile_count and reconcile_latency are missing . Could be that knative has deprecated those or are there other equivalents?

tekton_pipelines_controller_reconcile_count (counter) - Number of reconcile operations | reconciler|success|namespace_name
tekton_pipelines_controller_reconcile_latency | histogram | Latency of reconcile operations  

github.com/spiffe/go-spiffe/v2 v2.6.0
github.com/spiffe/spire-api-sdk v1.14.0
github.com/tektoncd/plumbing v0.0.0-20220817140952-3da8ce01aeeb
go.opencensus.io v0.24.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khrm Why do we still require opencensus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being used for tracing.

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 13, 2026
@khrm
Copy link
Contributor Author

khrm commented Jan 14, 2026

#9043 (comment)

The following knative metrics - reconcile_count and reconcile_latency are missing . Could be that knative has deprecated those or are there other equivalents?

tekton_pipelines_controller_reconcile_count (counter) - Number of reconcile operations | reconciler|success|namespace_name
tekton_pipelines_controller_reconcile_latency | histogram | Latency of reconcile operations

Yes, there's an equivalent - kn.workqueue.work_duration_count must be there.

Copy link
Member

@aThorp96 aThorp96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing but I've left some notes and questions

app.kubernetes.io/instance: default
app.kubernetes.io/part-of: tekton-pipelines
data:
metrics-protocol: prometheus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally included in the configmap since it's already the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intentionally introduced in configmap. Default is none which causes no metrics to be emitted but earlier were emitting prometheus metrics by default.

# OpenTelemetry Metrics Configuration
# Protocol for metrics export (prometheus, grpc, http/protobuf, none)
# Default: prometheus
metrics-protocol: prometheus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new settings all use metrics-* or tracing-* whereas settings previously were metrics.*. Is there any reason for the naming convention change? The unchanged settings are still metrics.* so if possible we should keep the naming consistent

Copy link
Contributor Author

@khrm khrm Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these changes are due to OTEL migration.

Other metrics configurations are related to Tekton, so I am not sure whether there's any benefit to making breaking changes. Users don't need to make any changes if they were already using Prometheus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these changes are due to OTEL migration.

I assume that means OTEL consumes these directly so we can't control their names, which makes sense. Would you be opposed to linking to the OTEL docs in the comments of this section of settings? That should help with both documentation and delineating what settings we "own" and what are inherited.

Other metrics configurations are related to Tekton, so I am not sure whether there's any benefit to making breaking changes.

I agree, makes sense.

Users don't need to make any changes if they were already using Prometheus.

Makes sense. That does make me realize that we currently document stackdriver support. I've never used stackdriver and am not familiar with it, but for users currently using stackdriver what does their migration path look like?

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2026
khrm added 6 commits January 15, 2026 16:06
This upgrades knative to latest and other dependent dependencies.
Replaces the legacy knative.dev/pkg/metrics dependency with knative.dev/pkg/observability/configmap to align with the OpenTelemetry migration and ensure correct configuration loading.
This isn't supported after knative otel's migration.
The knativetest.Flags.EmitMetrics flag is no longer supported after
the migration to OpenTelemetry. This commit removes the conditional
initialisation to align with the new metrics system.
Run hack/update-codegen.sh to update generated client code, deepcopy functions, and CRD manifests. This aligns the generated code with recent dependency updates, including changes to finalizer management and context usage in informers.
Update the expected checksums in Pipeline and Task unit tests to match the new values generated after the knative update:
- pkg/apis/pipeline/v1
- pkg/apis/pipeline/v1beta1
@khrm khrm force-pushed the mig-otel branch 3 times, most recently from 492fef8 to bf58998 Compare January 15, 2026 11:52
@khrm
Copy link
Contributor Author

khrm commented Jan 15, 2026

some tests in test/ folder are still using opencensus. Could you assess that and remove opencensus from go.mod.

This can be covered in separate PR. It's related to tracing. We need to migrate to knative based otel tracing.
@waveywaves @vdemeester

metrics.allow-stackdriver-custom-metrics: "false"
# OpenTelemetry Metrics Configuration
# Protocol for metrics export (prometheus, grpc, http/protobuf, none)
# Default: prometheus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have to specify "prometheus" to make it "default" in our deployment, I think this comment needs to be clarified:

Suggested change
# Default: prometheus
# Default if not specified: "none"

Also: When you say (prometheus, grpc, http/protobuf, none), are the supported values ["prometheus", "grpc", "http/protobuf", "none"] or are the supported values ["prometheus", "grpc", "http/protobuf", ""] ? If "none" is "", then my above suggestion would become

Suggested change
# Default: prometheus
# Default if not specified: "" (none)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's "none".

This commit migrates the metrics for PipelineRuns and TaskRuns from
OpenCensus to OpenTelemetry.

  The following changes are included:
   - Updated the observability config to support OpenTelemetry.
   - Migrated the implementation of PipelineRun and TaskRun metrics to use
    the OpenTelemetry Go SDK.
   - Updated the tests to work with the new OpenTelemetry-based
    implementation.
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this major migration effort! I've reviewed the metrics implementation and OpenCensus cleanup. The OpenTelemetry migration looks solid overall, but there are some remaining OpenCensus references that need to be addressed before merge.

Summary

Metrics Implementation ✅

The core metrics migration to OpenTelemetry is well-executed:

  • ✅ Proper use of OTel instruments (Histograms, Counters, ObservableGauges)
  • ✅ Backward-compatible metric names preserved
  • ✅ Good thread safety with mutex usage
  • ✅ Observable gauge callbacks implemented correctly

OpenCensus Cleanup ❌

Blocker: OpenCensus is still present in test files and go.mod

Critical Question ⚠️

@anithapriyanatarajan raised a concern on Jan 6 about throttled metrics accumulating instead of showing current values. Looking at the current code (lines 346-380 in pkg/taskrunmetrics/metrics.go), the implementation creates fresh maps on each observation, which should be correct. Has this issue been resolved? If so, it would be helpful to document what was fixed.

Required Changes

1. Remove OpenCensus from test files (Blocker)

The following test files still import and use OpenCensus for tracing:

test/custom_task_test.go line 37:

import "go.opencensus.io/trace"

Used on line 257 and throughout for test span tracking.

test/wait.go line 6:
Same import, used in lines 90, 117, 135, 153, 179, 209 for various wait function spans.

Action Required:
Migrate to OpenTelemetry tracing API:

// OLD (OpenCensus):
import "go.opencensus.io/trace"
_, span := trace.StartSpan(ctx, metricName)
defer span.End()

// NEW (OpenTelemetry):
import "go.opentelemetry.io/otel"
tracer := otel.Tracer("tekton-e2e-tests")
ctx, span := tracer.Start(ctx, metricName)
defer span.End()

2. Clean up go.mod

After migrating the test files, remove the OpenCensus dependency:

go.opencensus.io v0.24.0

Steps:

  1. Migrate test/custom_task_test.go and test/wait.go to OTel
  2. Run go mod tidy
  3. Run ./hack/update-deps.sh to update vendor/
  4. Verify go.opencensus.io is completely removed

Note: If it comes back as an indirect dependency, we'll need to track down which dependency is still pulling it in.

3. Throttled Metrics Question

In pkg/taskrunmetrics/metrics.go lines 346-380, the throttled metrics implementation looks correct to me:

  1. trsThrottledByQuota and trsThrottledByNode are local maps created fresh on each callback
  2. They are populated by iterating current TaskRuns
  3. They are reported once per observation

This should give the current count, not accumulate. Can you confirm:

  • Was the accumulation bug fixed in a recent commit?
  • Is the current behavior correct now?

Overall Assessment

The metrics migration is well-implemented. Once the OpenCensus cleanup is complete and the throttled metrics question is clarified, this should be good to merge.

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding inline comments on the specific files modified in this PR (following up on my previous review).

"github.com/tektoncd/pipeline/pkg/apis/pipeline"
pipeline "github.com/tektoncd/pipeline/pkg/apis/pipeline"
v1 "github.com/tektoncd/pipeline/pkg/apis/pipeline/v1"
listers "github.com/tektoncd/pipeline/pkg/client/listers/pipeline/v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great OTel Migration

The migration from OpenCensus to OpenTelemetry looks excellent here:

Before (OpenCensus):

"go.opencensus.io/stats"
"go.opencensus.io/stats/view"
"go.opencensus.io/tag"

After (OpenTelemetry):

"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"

This is the right approach.

flag.Parse()
flag.Set("alsologtostderr", "true")
logging.InitializeLogger()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCensus Metric Exporter Removed

Good - removed the old OpenCensus metric exporter initialization.

However, test/custom_task_test.go and test/wait.go still import OpenCensus for tracing (not metrics):

import "go.opencensus.io/trace"

These files are NOT modified in this PR but still use OpenCensus. They need to be migrated to OpenTelemetry tracing in a follow-up or this PR:

// Replace with:
import "go.opentelemetry.io/otel"

tracer := otel.Tracer("tekton-e2e-tests")
ctx, span := tracer.Start(ctx, metricName)
defer span.End()

github.com/pkg/errors v0.9.1
github.com/sigstore/sigstore v1.9.5
github.com/spiffe/go-spiffe/v2 v2.6.0
github.com/spiffe/spire-api-sdk v1.14.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCensus Dependency Still Present

After the OpenTelemetry migration, this should be removed:

go.opencensus.io v0.24.0

Why it's still here:
The test files test/custom_task_test.go and test/wait.go still use OpenCensus for tracing (see my comment on test/util.go).

To remove:

  1. Migrate those test files to OTel tracing
  2. Run go mod tidy
  3. Run ./hack/update-deps.sh to update vendor/
  4. Verify go.opencensus.io is gone (or only present as indirect dependency)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Drop use of knative.dev/pkg/metrics

10 participants