-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat(metrics): Migrate from OpenCensus to OpenTelemetry #9043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The following is the coverage report on the affected files.
|
bab5a44 to
a2ea129
Compare
|
The following is the coverage report on the affected files.
|
|
The following is the coverage report on the affected files.
|
|
/retest |
|
The following is the coverage report on the affected files.
|
|
/retest |
|
Re knative bump, we discussed multiple times that we want knative/pkg@04fdd0b included in the next bump, even though it only exists in main for now (not even in 1.20). It'll unlock the better finalizer management which proved to be an issue for Tekton deployments with multiple controllers managing the same PR/TR resources. So do we want to bump knative/pkg even higher? |
|
/assign @vdemeester @waveywaves |
|
@enarha I have updated to latest knative/pkg for now. |
|
Why are ci tests being skipped? |
|
/kind feature |
@anithapriyanatarajan Can you recheck this? |
anithapriyanatarajan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some tests in test/ folder are still using opencensus. Could you assess that and remove opencensus from go.mod.
| } | ||
|
|
||
| var statsReporterOptions []webhook.StatsReporterOption | ||
| enableNamespace := os.Getenv("WEBHOOK_METRICS_ENABLE_NAMESPACE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khrm - Is this a breaking change for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following knative metrics - reconcile_count and reconcile_latency are missing . Could be that knative has deprecated those or are there other equivalents?
tekton_pipelines_controller_reconcile_count (counter) - Number of reconcile operations | reconciler|success|namespace_name
tekton_pipelines_controller_reconcile_latency | histogram | Latency of reconcile operations
| github.com/spiffe/go-spiffe/v2 v2.6.0 | ||
| github.com/spiffe/spire-api-sdk v1.14.0 | ||
| github.com/tektoncd/plumbing v0.0.0-20220817140952-3da8ce01aeeb | ||
| go.opencensus.io v0.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khrm Why do we still require opencensus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being used for tracing.
Yes, there's an equivalent - kn.workqueue.work_duration_count must be there. |
aThorp96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing but I've left some notes and questions
| app.kubernetes.io/instance: default | ||
| app.kubernetes.io/part-of: tekton-pipelines | ||
| data: | ||
| metrics-protocol: prometheus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentionally included in the configmap since it's already the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's intentionally introduced in configmap. Default is none which causes no metrics to be emitted but earlier were emitting prometheus metrics by default.
| # OpenTelemetry Metrics Configuration | ||
| # Protocol for metrics export (prometheus, grpc, http/protobuf, none) | ||
| # Default: prometheus | ||
| metrics-protocol: prometheus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new settings all use metrics-* or tracing-* whereas settings previously were metrics.*. Is there any reason for the naming convention change? The unchanged settings are still metrics.* so if possible we should keep the naming consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these changes are due to OTEL migration.
Other metrics configurations are related to Tekton, so I am not sure whether there's any benefit to making breaking changes. Users don't need to make any changes if they were already using Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these changes are due to OTEL migration.
I assume that means OTEL consumes these directly so we can't control their names, which makes sense. Would you be opposed to linking to the OTEL docs in the comments of this section of settings? That should help with both documentation and delineating what settings we "own" and what are inherited.
Other metrics configurations are related to Tekton, so I am not sure whether there's any benefit to making breaking changes.
I agree, makes sense.
Users don't need to make any changes if they were already using Prometheus.
Makes sense. That does make me realize that we currently document stackdriver support. I've never used stackdriver and am not familiar with it, but for users currently using stackdriver what does their migration path look like?
This upgrades knative to latest and other dependent dependencies.
Replaces the legacy knative.dev/pkg/metrics dependency with knative.dev/pkg/observability/configmap to align with the OpenTelemetry migration and ensure correct configuration loading.
This isn't supported after knative otel's migration.
The knativetest.Flags.EmitMetrics flag is no longer supported after the migration to OpenTelemetry. This commit removes the conditional initialisation to align with the new metrics system.
Run hack/update-codegen.sh to update generated client code, deepcopy functions, and CRD manifests. This aligns the generated code with recent dependency updates, including changes to finalizer management and context usage in informers.
Update the expected checksums in Pipeline and Task unit tests to match the new values generated after the knative update: - pkg/apis/pipeline/v1 - pkg/apis/pipeline/v1beta1
492fef8 to
bf58998
Compare
This can be covered in separate PR. It's related to tracing. We need to migrate to knative based otel tracing. |
config/config-observability.yaml
Outdated
| metrics.allow-stackdriver-custom-metrics: "false" | ||
| # OpenTelemetry Metrics Configuration | ||
| # Protocol for metrics export (prometheus, grpc, http/protobuf, none) | ||
| # Default: prometheus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have to specify "prometheus" to make it "default" in our deployment, I think this comment needs to be clarified:
| # Default: prometheus | |
| # Default if not specified: "none" |
Also: When you say (prometheus, grpc, http/protobuf, none), are the supported values ["prometheus", "grpc", "http/protobuf", "none"] or are the supported values ["prometheus", "grpc", "http/protobuf", ""] ? If "none" is "", then my above suggestion would become
| # Default: prometheus | |
| # Default if not specified: "" (none) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "none".
This commit migrates the metrics for PipelineRuns and TaskRuns from
OpenCensus to OpenTelemetry.
The following changes are included:
- Updated the observability config to support OpenTelemetry.
- Migrated the implementation of PipelineRun and TaskRun metrics to use
the OpenTelemetry Go SDK.
- Updated the tests to work with the new OpenTelemetry-based
implementation.
vdemeester
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this major migration effort! I've reviewed the metrics implementation and OpenCensus cleanup. The OpenTelemetry migration looks solid overall, but there are some remaining OpenCensus references that need to be addressed before merge.
Summary
Metrics Implementation ✅
The core metrics migration to OpenTelemetry is well-executed:
- ✅ Proper use of OTel instruments (Histograms, Counters, ObservableGauges)
- ✅ Backward-compatible metric names preserved
- ✅ Good thread safety with mutex usage
- ✅ Observable gauge callbacks implemented correctly
OpenCensus Cleanup ❌
Blocker: OpenCensus is still present in test files and go.mod
Critical Question ⚠️
@anithapriyanatarajan raised a concern on Jan 6 about throttled metrics accumulating instead of showing current values. Looking at the current code (lines 346-380 in pkg/taskrunmetrics/metrics.go), the implementation creates fresh maps on each observation, which should be correct. Has this issue been resolved? If so, it would be helpful to document what was fixed.
Required Changes
1. Remove OpenCensus from test files (Blocker)
The following test files still import and use OpenCensus for tracing:
test/custom_task_test.go line 37:
import "go.opencensus.io/trace"Used on line 257 and throughout for test span tracking.
test/wait.go line 6:
Same import, used in lines 90, 117, 135, 153, 179, 209 for various wait function spans.
Action Required:
Migrate to OpenTelemetry tracing API:
// OLD (OpenCensus):
import "go.opencensus.io/trace"
_, span := trace.StartSpan(ctx, metricName)
defer span.End()
// NEW (OpenTelemetry):
import "go.opentelemetry.io/otel"
tracer := otel.Tracer("tekton-e2e-tests")
ctx, span := tracer.Start(ctx, metricName)
defer span.End()2. Clean up go.mod
After migrating the test files, remove the OpenCensus dependency:
go.opencensus.io v0.24.0
Steps:
- Migrate
test/custom_task_test.goandtest/wait.goto OTel - Run
go mod tidy - Run
./hack/update-deps.shto update vendor/ - Verify
go.opencensus.iois completely removed
Note: If it comes back as an indirect dependency, we'll need to track down which dependency is still pulling it in.
3. Throttled Metrics Question
In pkg/taskrunmetrics/metrics.go lines 346-380, the throttled metrics implementation looks correct to me:
trsThrottledByQuotaandtrsThrottledByNodeare local maps created fresh on each callback- They are populated by iterating current TaskRuns
- They are reported once per observation
This should give the current count, not accumulate. Can you confirm:
- Was the accumulation bug fixed in a recent commit?
- Is the current behavior correct now?
Overall Assessment
The metrics migration is well-implemented. Once the OpenCensus cleanup is complete and the throttled metrics question is clarified, this should be good to merge.
vdemeester
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding inline comments on the specific files modified in this PR (following up on my previous review).
| "github.com/tektoncd/pipeline/pkg/apis/pipeline" | ||
| pipeline "github.com/tektoncd/pipeline/pkg/apis/pipeline" | ||
| v1 "github.com/tektoncd/pipeline/pkg/apis/pipeline/v1" | ||
| listers "github.com/tektoncd/pipeline/pkg/client/listers/pipeline/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Great OTel Migration
The migration from OpenCensus to OpenTelemetry looks excellent here:
Before (OpenCensus):
"go.opencensus.io/stats"
"go.opencensus.io/stats/view"
"go.opencensus.io/tag"After (OpenTelemetry):
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/metric"This is the right approach.
| flag.Parse() | ||
| flag.Set("alsologtostderr", "true") | ||
| logging.InitializeLogger() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCensus Metric Exporter Removed
Good - removed the old OpenCensus metric exporter initialization.
However, test/custom_task_test.go and test/wait.go still import OpenCensus for tracing (not metrics):
import "go.opencensus.io/trace"These files are NOT modified in this PR but still use OpenCensus. They need to be migrated to OpenTelemetry tracing in a follow-up or this PR:
// Replace with:
import "go.opentelemetry.io/otel"
tracer := otel.Tracer("tekton-e2e-tests")
ctx, span := tracer.Start(ctx, metricName)
defer span.End()| github.com/pkg/errors v0.9.1 | ||
| github.com/sigstore/sigstore v1.9.5 | ||
| github.com/spiffe/go-spiffe/v2 v2.6.0 | ||
| github.com/spiffe/spire-api-sdk v1.14.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCensus Dependency Still Present
After the OpenTelemetry migration, this should be removed:
go.opencensus.io v0.24.0
Why it's still here:
The test files test/custom_task_test.go and test/wait.go still use OpenCensus for tracing (see my comment on test/util.go).
To remove:
- Migrate those test files to OTel tracing
- Run
go mod tidy - Run
./hack/update-deps.shto update vendor/ - Verify
go.opencensus.iois gone (or only present as indirect dependency)
Changes
pipelinerunmetricsandtaskrunmetricsto use OpenTelemetry instruments (histograms, counters, gauges) for creating and recording metrics.Introduced new OpenTelemetry configurations in
config/config-observability.yamlfor exporters and protocols..Rewrote the test suites for
pipelinerunmetricsandtaskrunmetricsto be compatible with the new OpenTelemetry-based implementation.fixes #8969
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
/kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes
/kind feature