ruby: add DTV-based TLS access for ruby_current_ec#1226
ruby: add DTV-based TLS access for ruby_current_ec#1226christos68k merged 9 commits intoopen-telemetry:mainfrom
Conversation
|
flagging that #1229 also uses DTV stuff, we need to verify that these two PRs don't conflict, and are aligned in their approach for DTV access. |
florianl
left a comment
There was a problem hiding this comment.
just minor style comments. maybe we could add a dtv check helper function and use it in rubys Loader() and VisitTLSRelocations().
18d5d03 to
1b19af1
Compare
@florianl thanks for the first pass, addressed your nits I believe, but I need clarification on
Did you mean something along these lines? diff --git a/interpreter/ruby/ruby.go b/interpreter/ruby/ruby.go
index 2f3eeddf..36c51210 100644
--- a/interpreter/ruby/ruby.go
+++ b/interpreter/ruby/ruby.go
@@ -4,7 +4,6 @@
package ruby // import "go.opentelemetry.io/ebpf-profiler/interpreter/ruby"
import (
- "debug/elf"
"encoding/binary"
"errors"
"fmt"
@@ -1405,30 +1404,27 @@ func Loader(ebpf interpreter.EbpfHandler, info *interpreter.LoaderInfo) (interpr
// ruby_current_ec TLS symbol.
// We could potentially add a fallback for this in the future, but for now
// only unstripped ruby is supported. Many distro supplied rubies are stripped.
- if err = ef.VisitTLSRelocations(func(r pfelf.ElfReloc, symName string) bool {
- if symName == rubyCurrentEcTlsSymbol ||
- libpf.SymbolValue(r.Addend) == currentEcSymbolAddress {
- currentEcTpBaseTlsOffset = libpf.Address(r.Off)
- return false
+ //
+ // We scan relocations in a single pass, looking for both TLSDESC (preferred)
+ // and DTPMOD64 (DTV fallback when TLSDESC is unavailable).
+ var tlsModuleIdOffset libpf.Address
+ if err = ef.VisitRelocations(func(r pfelf.ElfReloc, symName string) bool {
+ switch {
+ case ef.IsTLSDesc(r):
+ if symName == rubyCurrentEcTlsSymbol ||
+ libpf.SymbolValue(r.Addend) == currentEcSymbolAddress {
+ currentEcTpBaseTlsOffset = libpf.Address(r.Off)
+ return false
+ }
+ case ef.IsDTPMOD64(r):
+ log.Debugf("Found DTPMOD64 relocation at offset %x", r.Off)
+ tlsModuleIdOffset = libpf.Address(r.Off)
}
return true
+ }, func(r pfelf.ElfReloc) bool {
+ return ef.IsTLSDesc(r) || ef.IsDTPMOD64(r)
}); err != nil {
- log.Warnf("failed to locate TLS descriptor: %v", err)
- }
-
- // Look for DTPMOD64 relocation to find the TLS module ID offset.
- // This is used for DTV-based TLS access when TLSDESC is unavailable.
- var tlsModuleIdOffset libpf.Address
- if err = ef.VisitRelocations(func(r pfelf.ElfReloc, _ string) bool {
- log.Debugf("Found DTPMOD64 relocation at offset %x", r.Off)
- tlsModuleIdOffset = libpf.Address(r.Off)
- return false
- }, func(rela pfelf.ElfReloc) bool {
- ty := rela.Info & 0xffff
- return (ef.Machine == elf.EM_AARCH64 && elf.R_AARCH64(ty) == elf.R_AARCH64_TLS_DTPMOD64) ||
- (ef.Machine == elf.EM_X86_64 && elf.R_X86_64(ty) == elf.R_X86_64_DTPMOD64)
- }); err != nil {
- log.Warnf("failed to find DTPMOD64 relocation: %v", err)
+ log.Warnf("failed to locate TLS relocations: %v", err)
}
log.Debugf("Discovered EC tls tpbase offset %x, dtpmod offset %x, fallback ctx %x, interp ranges: %v, global symbols: %x",
diff --git a/libpf/pfelf/file.go b/libpf/pfelf/file.go
index 27581c09..ed3f1ce9 100644
--- a/libpf/pfelf/file.go
+++ b/libpf/pfelf/file.go
@@ -635,16 +635,39 @@ func (f *File) DebuglinkFileName(elfFilePath string, elfOpener ELFOpener) string
type ElfReloc *elf.Rela64
+// IsTLSDesc returns true if the relocation is a TLSDESC relocation for the
+// machine type of this ELF file.
+func (f *File) IsTLSDesc(rela ElfReloc) bool {
+ ty := rela.Info & 0xffff
+ switch f.Machine {
+ case elf.EM_AARCH64:
+ return elf.R_AARCH64(ty) == elf.R_AARCH64_TLSDESC
+ case elf.EM_X86_64:
+ return elf.R_X86_64(ty) == elf.R_X86_64_TLSDESC
+ default:
+ return false
+ }
+}
+
+// IsDTPMOD64 returns true if the relocation is a DTPMOD64 relocation for the
+// machine type of this ELF file.
+func (f *File) IsDTPMOD64(rela ElfReloc) bool {
+ ty := rela.Info & 0xffff
+ switch f.Machine {
+ case elf.EM_AARCH64:
+ return elf.R_AARCH64(ty) == elf.R_AARCH64_TLS_DTPMOD64
+ case elf.EM_X86_64:
+ return elf.R_X86_64(ty) == elf.R_X86_64_DTPMOD64
+ default:
+ return false
+ }
+}
+
// VisitTLSDescriptors visits all TLS relocations and provides the relocation
// for the TLS symbol, as well as a best-effort string for the symbol's name
// it continues until the visitor returns false
func (f *File) VisitTLSRelocations(visitor func(ElfReloc, string) bool) error {
- checkFunc := func(rela ElfReloc) bool {
- ty := rela.Info & 0xffff
- return (f.Machine == elf.EM_AARCH64 && elf.R_AARCH64(ty) == elf.R_AARCH64_TLSDESC) ||
- (f.Machine == elf.EM_X86_64 && elf.R_X86_64(ty) == elf.R_X86_64_TLSDESC)
- }
- return f.VisitRelocations(visitor, checkFunc)
+ return f.VisitRelocations(visitor, f.IsTLSDesc)
}
--
2.52.0
Where this If not, could you clarify what kind of DTV helper you meant? |
Yeah, I had something like this in mind. But its also fine for me without. |
For now i think let's leave without, we can revisit refactoring this code later - for now this feels premature to me. |
fabled
left a comment
There was a problem hiding this comment.
a small comment on the filter function to make it just a const instead of function argument. Also, would it be possible to get musl based coredump test?
84eba46 to
1a9790f
Compare
Good idea, that's a lot cleaner. No need for
Done, this ended up being more of an adventure than I thought - i was taking the coredumps inside an alpine docker container, but it turns out if you try and import them on a glibc-based host, it fails in strange ways. Took me a while to realize this was because of CGO / using gcore built for the wrong libc. As an aside, we should probably update the coredump tooling to bail if the libc's don't match when calling As expected though, the PR is now failing CI - @florianl can you please import ruby-3.4.7-musl-dtv-loop-amd64.moduledata.tar ? That should cause CI to pass (I also verified the tests themselves should be able to run despite a different glibc, it is only the importing that is problematic, FWIW). FWIW I also ran validation tests where I explicitly forced the "single_ractor_main" fallback for the EC to fail by updating the symbol name to append Hopefully, when we upload the new blob and rerun, everything passes 🤞 |
fabled
left a comment
There was a problem hiding this comment.
Thanks! One nit, but approving at this point.
| checkFunc := func(rela ElfReloc) bool { | ||
| return f.classifyReloc(rela)&relTypes != 0 | ||
| } |
There was a problem hiding this comment.
nit: Perhaps the switch f.Machine could be here, and it assigns checkFunc (maybe call it filterFunc instead?) to a previously defined "plain" function that is architecture specific. We get the switch out from inner loop, and the code simplifies a little bit.
There was a problem hiding this comment.
Thanks @fabled i think i've addressed this, let me know if it isn't what you had in mind.
dece87c to
d1f7408
Compare
bbd4346 to
df19871
Compare
|
@fabled @florianl I have put this back in draft as from @nsavoire 's discoveries above, there are two issues with the code that this depends on that need to be addressed first. I have opened a PR for each one:
My profuse apologies for both oversights, which made it into main in #929 which I submitted. Also to address the issue of the coredump tests not failing on main, I want to resubmit them but first run: This will allow the coredump artifact itself to not do the single_main_ractor_fallback, rather than having to fudge the symbol in go as I had done above. I'll prepare new coredump tests and if/when the above two PRs land, rebase this branch and re-open it for review - but it should not be merged in its current state. |
a8756d6 to
b6ef3e3
Compare
|
@florianl i've added two replacement coredumps, please add these to the CI module data (and we can remove the previously added dtv ones as unused). These two should properly fail on main without any modifications to the source code, as i've stripped the fallback symbol out of libruby.so before taking the coredump. I've rebased with the other fixes mentioned above that have since merged, this PR should be green and ready to go again once the module data is uploaded. |
|
@fabled can you merge this PR? |
|
Thanks for uploading, to prove that the coredump tests fail on main, i checked them out onto a branch based on main and ran CI: as expected they both fail So this should be (finally) good to go. |
|
@dalehamel Can you resolve conflicts? I'll merge after. |
When TLSDESC relocations are unavailable (e.g. some musl setups or dynamically allocated TLS), the Ruby execution context can still be found by traversing the Dynamic Thread Vector (DTV). This commit: - Adds a shared dtv_read() helper in tsd.h (alongside existing tsd_read) that traverses the DTV using DTVInfo from PR open-telemetry#929's libc introspection. This helper is available to all interpreters, not just Ruby. - Generalizes VisitTLSRelocations into VisitRelocations with a pluggable relocation type filter, enabling lookup of DTPMOD64 relocations to find the TLS module ID offset for libruby.so. - Adds DTVInfo, current_ec_tls_offset, and tls_module_id fields to RubyProcInfo so the eBPF unwinder can use DTV traversal. - Implements UpdateLibcInfo for Ruby to receive DTVInfo when the libc package provides it (may arrive from a separate DSO like ld-linux.so). - Adds a DTV fallback path in ruby_tracer.ebpf.c between the existing TLSDESC path and the ractor fallback.
Adds a coredump test for a shared-library Ruby 3.4.7 on amd64, where libruby.so uses R_X86_64_DTPMOD64 relocations (the default x86_64 TLS dialect). This exercises the DTV-based TLS lookup path: VisitRelocations with DTPMOD64 filter, UpdateLibcInfo with DTVInfo, and dtv_read() in the eBPF unwinder. This test is amd64-only because aarch64 linkers aggressively relax GD TLS to TLSDESC, making it impossible to produce DTPMOD64 relocations on aarch64 with standard toolchains. Captured via gdb attach + breakpoint on rb_vm_exec to ensure a clean snapshot deep in the Ruby VM stack.
- Convert relocation type checks from chained || expressions to switch statements in both ruby.go Loader() and pfelf VisitTLSRelocations() for improved readability. - Include module_id in DTV read failure DEBUG_PRINT for easier debugging.
Address review feedback from fabled: move arch-specific relocation type checks into pfelf as abstract RelocType constants (RelTLSDESC, RelDTPMOD64) with a bitmask-based filter, replacing the function argument in VisitRelocations. This centralizes the arch-to-reloc mapping in classifyReloc() and simplifies callers.
Ruby 3.4.7 built from source with ruby-install on Alpine (musl libc), linked with --enable-shared. This exercises the DTV-based TLS access path for ruby_current_ec on musl, where TLSDESC relocations are not used and we fall back to DTPMOD64.
Move the architecture dispatch (switch f.Machine) from classifyReloc, which was called on every relocation entry, into VisitRelocations where it runs once to select an arch-specific classifier function. This avoids redundant machine-type checks in the inner loop. Rename checkFunc to filterFunc for clarity. Addresses review feedback from @fabled.
DTV access is always indirect: TP+offset yields a pointer to the DTV array, which must be dereferenced before indexing by module ID. This is true for both glibc and musl (see open-telemetry#1295) Thus, remove the conditional indirect check from dtv_read() in tsd.h and always perform the pointer dereference. Also remove the now-invalid reference to DTVInfo.Indirect in the Ruby interpreter debug logging. Thanks @nsavoire for uncovering this.
Regenerate both glibc and musl Ruby 3.4.7 DTV coredump tests with ruby_single_main_ractor stripped from libruby.so via objcopy, and coredump_filter set to 0x3f for full memory dumps. Both fail on main (which lacks the DTV path) and pass on this branch. This is a more robust enforcement of disabling the fallback to single_main_ractor in a way that cleanly adds the coredumps as regression test.
17f9eb3 to
80d11ce
Compare
Cheers, done. blobs rebuilt on top commit after rebase. |
What
Add DTV-based TLS access for finding
ruby_current_ecwhen TLSDESC relocations are unavailable, and a shareddtv_read()eBPF helper for traversing the Dynamic Thread Vector.Partially addresses #883, at least for ruby
Why
On x86_64 with the default TLS dialect,
libruby.sousesR_X86_64_DTPMOD64relocations instead of TLSDESC for thread-local storage. The existing Ruby profiler only supports TLSDESC-based TLS access, so it cannot findruby_current_econ these systems and falls back to the ractor path (which is slower and less reliable). This is also needed for systems where TLSDESC may not be available.How
dtv_read()helper insupport/ebpf/tsd.h(alongside existingtsd_read()) that traverses the DTV usingDTVInfofrom Extract DTV info from __tls_get_addr, add to LibcInfo #929's libc introspection. This helper is available to all interpreters, not just Ruby.VisitTLSRelocationsintoVisitRelocationswith a pluggable relocation type filter inlibpf/pfelf/file.go, enabling lookup ofDTPMOD64relocations to find the TLS module ID offset forlibruby.so.DTVInfo,current_ec_tls_offset, andtls_module_idfields toRubyProcInfoso the eBPF unwinder can use DTV traversal.UpdateLibcInfofor Ruby to receiveDTVInfowhen the libc package provides it (may arrive from a separate DSO likeld-linux.so).ruby_tracer.ebpf.cbetween the existing TLSDESC path and the ractor fallback.metricID_UnwindErrBadDTVReadfor observability of DTV read failures.NOTE: This depends on coredump artifacts which need to be added to the CI modulestore, they've been uploaded to a shared drive that @florianl has access to. Please re-run CI after uploading these artifacts.