Skip to content

feat: initial work to set-up opentelemetry tracing #1378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

edgurgel
Copy link
Member

@edgurgel edgurgel commented May 20, 2025

What kind of change does this PR introduce?

Introduces opentelemetry tracing using the great elixir/erlang libraries.

This PR:

  • Set up tracing for phoenix, cowboy and Ecto.
  • Add trace span for Database.connect (remote and local).
  • Add trace span for UserSocket.connect but only for the step that it connects not during the whole websocket duration as the websocket connection can exist for a really long time;
  • Propagation for traceparent and baggage;
  • Adds request_id to logger metadata.

And finally a new Plug to grab a request ID from the baggage if available. Otherwise it uses the original Plug.RequestId

The way it's configured nothing that is traced will be actually sent because there is no exporter. The libraries we use respect most of the OTEL environment variables so it's possible to define an exporter like this (Jaeger locally):

OTEL_TRACES_EXPORTER=otlp OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 iex -S mix phx.server

There are tests covering database connection local and remote. And some basic integration test hitting Phoenix as a black box so that the trace for the HTTP request is captured.

TODO:

  • Add a README section about OTEL once finalised.
  • still need to attach the sb-request-id (from baggage) as attribute for spans

Out-of-scope for this PR:

  • Add more traces like channel.join etc. Next PR can cover these.
  • Trace sampling per tenant etc

What is the current behavior?

No tracing.

What is the new behavior?

Possibility to receive tracing.

Additional context

Used Jaeger locally to visualise some of the traces:

❯ cat docker-compose.tracing.yml
services:
jaeger:
image: 'jaegertracing/jaeger:2.6.0'
ports:
- '5778:5778'
- '4317:4317'
- '4318:4318'
- '16686:16686'
- '9411:9411'

🍕_8fe1f65__POST__api_broadcast__realtime__—_Jaeger_UI Jaeger_UI

Copy link

vercel bot commented May 20, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
realtime-demo ⬜️ Ignored (Inspect) Visit Preview May 26, 2025 9:46pm

@Ziinc
Copy link
Contributor

Ziinc commented May 20, 2025

@edgurgel i would suggest making the sb-request-id propagation to logs metadata configuration-based via env var or otherwise so that we don't couple supabase specific stuff with realtime

@@ -232,6 +237,7 @@ defmodule RealtimeWeb.TenantController do

def reload(conn, %{"tenant_id" => tenant_id}) do
Logger.metadata(external_id: tenant_id, project: tenant_id)
Tracer.set_attributes(external_id: tenant_id)
Copy link
Contributor

@Ziinc Ziinc May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do this all in a plug in the router btw to cover all request

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly I don't think we can if we want to hook on the opentelemetry_phoenix span . opentelemetry_phoenix creates the span after the router dispatches so there is no span at the router stage to add an attribute to.

At least I tried and it always showed that no span context exists. Unless we create a span before opentelemetry_phoenix does.

https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/70f561d4f92f9233ade373351d0e97287588eb30/instrumentation/opentelemetry_phoenix/lib/opentelemetry_phoenix.ex#L151-L161

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a try again though. Now that I have the integration tests to quickly test this out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh I was probably holding it wrong... I was adding it on the "AssignTenant" plug and it was always showing no span context? 🤔 Well 🤷 Will update there and remove all the controller changes!

Thanks for bringing it up!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can place it in a plug then use the phoenix controller macro to set it in every controller in the use RealtimeWeb ... that would cover every route quite easily. and probably can move the Logger.metadata enrichment there too so that only biz logic is handled in controllers.

Copy link
Member Author

@edgurgel edgurgel May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but this specific one (AssignTenant) does it based on the host (https://$project_id.supabase.com tenant host) while that other one uses the path_params` (admin). So not a single solution across all endpoints. 🤷

@edgurgel
Copy link
Member Author

@edgurgel i would suggest making the sb-request-id propagation to logs metadata configuration-based via env var or otherwise so that we don't couple supabase specific stuff with realtime

Yup Makes sense! I will make the sb-request-id bit part of the config/env 👌

edgurgel added 2 commits May 21, 2025 13:08
Use assert_receive as it might take some time for the
otel_simple_processor send the spans across
* Set otel attributes using AssignTenant where possible
* Set otel attributes using a controller plug for TenantController

Also add unit tests!
@@ -26,6 +25,8 @@ defmodule RealtimeWeb.TenantController do

action_fallback(RealtimeWeb.FallbackController)

plug :set_observability_attributes when action in [:show, :edit, :update, :delete, :reload, :health]
Copy link
Member Author

@edgurgel edgurgel May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ziinc I ended up using a plug here as we don't need for all actions. I thought about moving this to the router but it would essentially need to look for tenant_id as path_params on all the routes which I don't know if that's what we want 🤷

This felt like a good compromise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plug can handle the situation where there isn't a tenant_id in path param with a fallthrough so you don't have to explicitly state the actions, that would make it less verbose
lgtm

@coveralls
Copy link

coveralls commented May 21, 2025

Coverage Status

coverage: 82.591% (+0.3%) from 82.314%
when pulling e70ca8e on basic-opentelemetry
into e27b587 on main.

edgurgel added 3 commits May 21, 2025 15:30
REQUEST_ID_BAGGAGE_KEY is the only one that is custom to Realtime
Avoid this:

```
  1) test distributed lookup_or_start_connection/1 trace distributed connect (Realtime.Tenants.ConnectTest)
Error:      test/realtime/tenants/connect_test.exs:36
     ** (ErlangError) Erlang error: {:exception, {:badfun, #Function<2.131929709/0 in Clustered.start/2>}, [{:error_handler, :undefined_lambda, 3, [file: ~c"error_handler.erl", line: 120]}, {:erlang, :apply, 2, []}]}
     code: {:ok, node} = Clustered.start(
```
@edgurgel edgurgel marked this pull request as ready for review May 21, 2025 09:42
@edgurgel edgurgel requested review from Ziinc and filipecabaco May 21, 2025 09:42
@Ziinc
Copy link
Contributor

Ziinc commented May 21, 2025

@edgurgel remember to do a test against a logflare dev source prior to merge 🙇

@edgurgel
Copy link
Member Author

@Ziinc Just tested it out! It seems to be collecting them nicely as long as they have an instrumentation scope set.

mah_traces___Logflare___Cloudflare__Vercel___Elixir_Logging
Parent and child span: { "attributes": { "_client_address": "192.168.107.1", "_http_request_method": "POST", "_http_response_status_code": 202, "_http_route": "/api/broadcast", "_network_peer_address": "127.0.0.1", "_network_peer_port": 50801, "_network_protocol_version": "1.1", "_phoenix_action": "broadcast", "_phoenix_plug": "Elixir.RealtimeWeb.BroadcastController", "_server_address": "localhost", "_url_path": "/api/broadcast", "_url_scheme": "http", "_user_agent_original": "curl/8.7.1", "external_id": "realtime-dev", "request_id": "GEG9kXJJhoNxqusAAATh" }, "end_time": "2025-05-22T04:02:42.939720Z", "event_message": "POST /api/broadcast", "id": "117a008c-3aea-48ac-9b6d-7301a7e861cf", "metadata": { "name": "opentelemetry_cowboy", "process": { "executable": { "name": "erl" }, "runtime": { "description": "Erlang/OTP 27 erts-15.1", "name": "BEAM", "version": "15.1" } }, "service": { "instance": { "id": "[email protected]" }, "name": "realtime" }, "telemetry": { "sdk": { "language": "erlang", "name": "opentelemetry", "version": "1.5.0" } }, "type": "span", "version": "1.0.0" }, "parent_span_id": "B9FCAA078B7B3D08", "project": "opentelemetry_cowboy", "span_id": "B7B0BA5630319B44", "start_time": "2025-05-22T04:02:42.767301Z", "timestamp": 1747886562767301, "trace_id": "07c46ae6-ee3c-0fda-496e-8bcf7f6b1356" }

{
"attributes": {
"external_id": "realtime-dev"
},
"end_time": "2025-05-22T04:02:42.937639Z",
"event_message": "database.connect",
"id": "b0f8f1cf-7209-464c-a90e-a177026f9923",
"metadata": {
"name": "realtime",
"process": {
"executable": {
"name": "erl"
},
"runtime": {
"description": "Erlang/OTP 27 erts-15.1",
"name": "BEAM",
"version": "15.1"
}
},
"service": {
"instance": {
"id": "[email protected]"
},
"name": "realtime"
},
"telemetry": {
"sdk": {
"language": "erlang",
"name": "opentelemetry",
"version": "1.5.0"
}
},
"type": "span",
"version": "2.55.4"
},
"parent_span_id": "B7B0BA5630319B44",
"project": "realtime",
"span_id": "EDEAF2AFB92F0E24",
"start_time": "2025-05-22T04:02:42.917176Z",
"timestamp": 1747886562917176,
"trace_id": "07c46ae6-ee3c-0fda-496e-8bcf7f6b1356"
}

Copy link
Member

@filipecabaco filipecabaco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some readability concerns but overall LGTM 👍

Comment on lines 81 to 83
defp erpc_call(node, mod, func, args, timeout, _opts) when node == node() do
:erpc.call(node, mod, func, args, timeout)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why exclude local calls from telemetry?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it will be a normal local call which can be traced normally by the existing spans like here:

https://github.com/supabase/realtime/blob/basic-opentelemetry/lib/realtime/tenants/connect.ex#L402-L408

This could potentially be handled "transparently" by our RPC module but I'm not sure if it will fit all of our use cases. Let me try something and get back to you :brb:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This: cb4dce8

Comment on lines 401 to 414
opts =
if node == node() do
# Tracing context is passed for local calls
# This is necessary because :erpc will spawn a process when timeout is not :infinity
span_ctx = Tracer.start_span("database.connect")
ctx = OpenTelemetry.Ctx.get_current()
[otel: %{span_ctx: span_ctx, ctx: ctx}] ++ opts
else
# For actual remote calls we pass whatever context propagation we have available
case :otel_propagator_text_map.inject([]) do
[] -> []
otel_propagation -> [otel: otel_propagation]
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we able to move this outside of the Connect code somehow? Connect it's already quite complex and this adds quite a lot of readability load to this code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the whole tracing set-up is done on RPC itself: cb4dce8

defp process_otel(propagation) do
:otel_propagator_text_map.extract(propagation)
process_otel([])
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we able to move this outside of the Connect code somehow? Connect it's already quite complex and this adds quite a lot of readability load to this code

edgurgel added 4 commits May 26, 2025 17:05
It handles both remote and local calls
Due to the multiple requests sent during tests we can't guarantee how
many active connections will be up. We can only ensure that it's greater
than 0
@edgurgel edgurgel requested a review from filipecabaco May 27, 2025 09:26
Copy link
Member

@filipecabaco filipecabaco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next time lets try to make smaller changes. for example:

  • create the initial setup of opentelemetry
  • enhance RPC module with traces
  • enhance Sockets with traces
  • abstract to Plugs
  • enhance Connect with traces
  • etc...

with traces we can make them in a iterative way and it's ok to start smaller and add more spans as we go so we avoid big PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants