-
-
Notifications
You must be signed in to change notification settings - Fork 358
feat: initial work to set-up opentelemetry tracing #1378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@edgurgel i would suggest making the |
@@ -232,6 +237,7 @@ defmodule RealtimeWeb.TenantController do | |||
|
|||
def reload(conn, %{"tenant_id" => tenant_id}) do | |||
Logger.metadata(external_id: tenant_id, project: tenant_id) | |||
Tracer.set_attributes(external_id: tenant_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do this all in a plug in the router btw to cover all request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly I don't think we can if we want to hook on the opentelemetry_phoenix span . opentelemetry_phoenix creates the span after the router dispatches so there is no span at the router stage to add an attribute to.
At least I tried and it always showed that no span context exists. Unless we create a span before opentelemetry_phoenix does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give it a try again though. Now that I have the integration tests to quickly test this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh I was probably holding it wrong... I was adding it on the "AssignTenant" plug and it was always showing no span context? 🤔 Well 🤷 Will update there and remove all the controller changes!
Thanks for bringing it up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can place it in a plug then use the phoenix controller macro to set it in every controller in the use RealtimeWeb ... that would cover every route quite easily. and probably can move the Logger.metadata enrichment there too so that only biz logic is handled in controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but this specific one (AssignTenant) does it based on the host (https://$project_id.supabase.com tenant host) while that other one uses the path_params` (admin). So not a single solution across all endpoints. 🤷
Yup Makes sense! I will make the |
Use assert_receive as it might take some time for the otel_simple_processor send the spans across
* Set otel attributes using AssignTenant where possible * Set otel attributes using a controller plug for TenantController Also add unit tests!
@@ -26,6 +25,8 @@ defmodule RealtimeWeb.TenantController do | |||
|
|||
action_fallback(RealtimeWeb.FallbackController) | |||
|
|||
plug :set_observability_attributes when action in [:show, :edit, :update, :delete, :reload, :health] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ziinc I ended up using a plug
here as we don't need for all actions. I thought about moving this to the router but it would essentially need to look for tenant_id
as path_params on all the routes which I don't know if that's what we want 🤷
This felt like a good compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plug can handle the situation where there isn't a tenant_id in path param with a fallthrough so you don't have to explicitly state the actions, that would make it less verbose
lgtm
REQUEST_ID_BAGGAGE_KEY is the only one that is custom to Realtime
Avoid this: ``` 1) test distributed lookup_or_start_connection/1 trace distributed connect (Realtime.Tenants.ConnectTest) Error: test/realtime/tenants/connect_test.exs:36 ** (ErlangError) Erlang error: {:exception, {:badfun, #Function<2.131929709/0 in Clustered.start/2>}, [{:error_handler, :undefined_lambda, 3, [file: ~c"error_handler.erl", line: 120]}, {:erlang, :apply, 2, []}]} code: {:ok, node} = Clustered.start( ```
@edgurgel remember to do a test against a logflare dev source prior to merge 🙇 |
@Ziinc Just tested it out! It seems to be collecting them nicely as long as they have an instrumentation scope set. ![]() Parent and child span:
{
"attributes": {
"_client_address": "192.168.107.1",
"_http_request_method": "POST",
"_http_response_status_code": 202,
"_http_route": "/api/broadcast",
"_network_peer_address": "127.0.0.1",
"_network_peer_port": 50801,
"_network_protocol_version": "1.1",
"_phoenix_action": "broadcast",
"_phoenix_plug": "Elixir.RealtimeWeb.BroadcastController",
"_server_address": "localhost",
"_url_path": "/api/broadcast",
"_url_scheme": "http",
"_user_agent_original": "curl/8.7.1",
"external_id": "realtime-dev",
"request_id": "GEG9kXJJhoNxqusAAATh"
},
"end_time": "2025-05-22T04:02:42.939720Z",
"event_message": "POST /api/broadcast",
"id": "117a008c-3aea-48ac-9b6d-7301a7e861cf",
"metadata": {
"name": "opentelemetry_cowboy",
"process": {
"executable": {
"name": "erl"
},
"runtime": {
"description": "Erlang/OTP 27 erts-15.1",
"name": "BEAM",
"version": "15.1"
}
},
"service": {
"instance": {
"id": "[email protected]"
},
"name": "realtime"
},
"telemetry": {
"sdk": {
"language": "erlang",
"name": "opentelemetry",
"version": "1.5.0"
}
},
"type": "span",
"version": "1.0.0"
},
"parent_span_id": "B9FCAA078B7B3D08",
"project": "opentelemetry_cowboy",
"span_id": "B7B0BA5630319B44",
"start_time": "2025-05-22T04:02:42.767301Z",
"timestamp": 1747886562767301,
"trace_id": "07c46ae6-ee3c-0fda-496e-8bcf7f6b1356"
}
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some readability concerns but overall LGTM 👍
lib/realtime/rpc.ex
Outdated
defp erpc_call(node, mod, func, args, timeout, _opts) when node == node() do | ||
:erpc.call(node, mod, func, args, timeout) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why exclude local calls from telemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it will be a normal local call which can be traced normally by the existing spans like here:
This could potentially be handled "transparently" by our RPC module but I'm not sure if it will fit all of our use cases. Let me try something and get back to you :brb:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This: cb4dce8
lib/realtime/tenants/connect.ex
Outdated
opts = | ||
if node == node() do | ||
# Tracing context is passed for local calls | ||
# This is necessary because :erpc will spawn a process when timeout is not :infinity | ||
span_ctx = Tracer.start_span("database.connect") | ||
ctx = OpenTelemetry.Ctx.get_current() | ||
[otel: %{span_ctx: span_ctx, ctx: ctx}] ++ opts | ||
else | ||
# For actual remote calls we pass whatever context propagation we have available | ||
case :otel_propagator_text_map.inject([]) do | ||
[] -> [] | ||
otel_propagation -> [otel: otel_propagation] | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we able to move this outside of the Connect code somehow? Connect it's already quite complex and this adds quite a lot of readability load to this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the whole tracing set-up is done on RPC itself: cb4dce8
lib/realtime/tenants/connect.ex
Outdated
defp process_otel(propagation) do | ||
:otel_propagator_text_map.extract(propagation) | ||
process_otel([]) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we able to move this outside of the Connect code somehow? Connect it's already quite complex and this adds quite a lot of readability load to this code
It handles both remote and local calls
Due to the multiple requests sent during tests we can't guarantee how many active connections will be up. We can only ensure that it's greater than 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next time lets try to make smaller changes. for example:
- create the initial setup of opentelemetry
- enhance RPC module with traces
- enhance Sockets with traces
- abstract to Plugs
- enhance Connect with traces
- etc...
with traces we can make them in a iterative way and it's ok to start smaller and add more spans as we go so we avoid big PRs
What kind of change does this PR introduce?
Introduces opentelemetry tracing using the great elixir/erlang libraries.
This PR:
Database.connect
(remote and local).UserSocket.connect
but only for the step that it connects not during the whole websocket duration as the websocket connection can exist for a really long time;traceparent
andbaggage
;request_id
to logger metadata.And finally a new Plug to grab a request ID from the
baggage
if available. Otherwise it uses the originalPlug.RequestId
The way it's configured nothing that is traced will be actually sent because there is no exporter. The libraries we use respect most of the OTEL environment variables so it's possible to define an exporter like this (Jaeger locally):
There are tests covering database connection local and remote. And some basic integration test hitting Phoenix as a black box so that the trace for the HTTP request is captured.
TODO:
sb-request-id
(from baggage) as attribute for spansOut-of-scope for this PR:
channel.join
etc. Next PR can cover these.What is the current behavior?
No tracing.
What is the new behavior?
Possibility to receive tracing.
Additional context
Used Jaeger locally to visualise some of the traces:
❯ cat docker-compose.tracing.yml
services:
jaeger:
image: 'jaegertracing/jaeger:2.6.0'
ports:
- '5778:5778'
- '4317:4317'
- '4318:4318'
- '16686:16686'
- '9411:9411'