#710 has a lengthy discussion about this in this comment thread #710 (comment), and this comment has lots of great resources #710 (comment), including:
But fundamentally:
- On aarch64, we so far seem to be able to reliably use TLS descriptors. The compiler will relocate the TLS references, and there is no need to directly access the DTV.
- On x86_64, it uses legacy TLS access mode by default, unless users specify -mtls-dialect=gnu2 to the compiler. This mode requires us to access the TLS through the DTV, as the compiler won't do the relocations for us. For this we need to add additonal code that will detect:
- That the executable is using the legacy TLS mode (easy enough, we won't see the TLS descriptors, eg if we do
readelf -r PATH_TO_SO | grep TLS
- How to get the DTV offset from tpbase/fsbase (i think we should just be able to disassemble __tls_get_addr from libc or ld, and the offset should be readily apparent (see below)
- What the Module ID is. We should be able to get it from R_X86_64_DTPMOD64. Basically we read this relocation symbol for an ELF address, add that to the base address of the process, and we can read the module ID from that once the linker has loaded (no need to parse the link map)
- This will need to be verified to work with both glibc and musl
Getting the DTV offset from fsbase
The disassembly of __tls_get_addr shouldn't be too bad, here it is for glibc (x86_64):
Dump of assembler code for function __tls_get_addr:
0x00007ffff7fdc820 <+0>: endbr64
0x00007ffff7fdc824 <+4>: mov %fs:0x8,%rdx
0x00007ffff7fdc82d <+13>: mov 0x21874(%rip),%rax # 0x7ffff7ffe0a8 <_rtld_global+4264>
0x00007ffff7fdc834 <+20>: cmp %rax,(%rdx)
0x00007ffff7fdc837 <+23>: jne 0x7ffff7fdc84f <__tls_get_addr+47>
0x00007ffff7fdc839 <+25>: mov (%rdi),%rax
0x00007ffff7fdc83c <+28>: shl $0x4,%rax
0x00007ffff7fdc840 <+32>: mov (%rdx,%rax,1),%rax
0x00007ffff7fdc844 <+36>: cmp $0xffffffffffffffff,%rax
0x00007ffff7fdc848 <+40>: je 0x7ffff7fdc84f <__tls_get_addr+47>
0x00007ffff7fdc84a <+42>: add 0x8(%rdi),%rax
0x00007ffff7fdc84e <+46>: ret
0x00007ffff7fdc84f <+47>: push %rbp
0x00007ffff7fdc850 <+48>: mov %rsp,%rbp
0x00007ffff7fdc853 <+51>: and $0xfffffffffffffff0,%rsp
0x00007ffff7fdc857 <+55>: call 0x7ffff7fd9e30 <__tls_get_addr_slow>
0x00007ffff7fdc85c <+60>: mov %rbp,%rsp
0x00007ffff7fdc85f <+63>: pop %rbp
0x00007ffff7fdc860 <+64>: ret
The first line has what we need mov %fs:0x8,%rdx, the DTV is at $fs_base + 8 (which matches what I've determined through source code analysis, again in this comment #710 (comment))
And here it is for musl (x86_64):
(gdb) disassemble __tls_get_addr
Dump of assembler code for function __tls_get_addr:
0x0000000000065370 <+0>: mov %fs:0x0,%rax
0x0000000000065379 <+9>: mov (%rdi),%rcx
0x000000000006537c <+12>: mov 0x8(%rax),%rdx
0x0000000000065380 <+16>: mov 0x8(%rdi),%rax
0x0000000000065384 <+20>: add (%rdx,%rcx,8),%rax
0x0000000000065388 <+24>: ret
It again is accessing it at $fs_base + 8 (as we can interpret from the source code, it should be at an offset of 8), but we need to do a bit more work to get this value from the disassembly.
Static TLS
As it happens, ruby can be significantly faster if we use a static build (with no libruby.so, all code directly in bin/ruby), and one of the reason is that you can read the value directly relative from $fsbase. Ruby checks the execution context a LOT, and so this ends up being a significant speedup. So, I'll use ruby as an example
In this model, we just need to look up a constant when we disassemble:
Dump of assembler code for function rb_current_execution_context:
0x000055555558d244 <+0>: push %rbp
0x000055555558d245 <+1>: mov %rsp,%rbp
0x000055555558d248 <+4>: mov %edi,%eax
0x000055555558d24a <+6>: mov %al,-0x14(%rbp)
0x000055555558d24d <+9>: mov $0xfffffffffffffff0,%rax
0x000055555558d254 <+16>: mov %fs:(%rax),%rax
0x000055555558d258 <+20>: mov %rax,-0x8(%rbp)
0x000055555558d25c <+24>: mov -0x8(%rbp),%rax
0x000055555558d260 <+28>: pop %rbp
0x000055555558d261 <+29>: ret
We also can't get away with reading the TLSDesc from the relocation table, because there isn't one. But, we should be able to calculate that constant ($0xfffffffffffffff0 is 16 in 2s compliment, which is the negative offset from $fsbase where the value is located, which we can compute using the TLS symbol value for ruby_current_ec, and we don't need to access the DTV at all.
cc @fabled
#710 has a lengthy discussion about this in this comment thread #710 (comment), and this comment has lots of great resources #710 (comment), including:
But fundamentally:
readelf -r PATH_TO_SO | grep TLSGetting the DTV offset from fsbase
The disassembly of __tls_get_addr shouldn't be too bad, here it is for glibc (x86_64):
The first line has what we need
mov %fs:0x8,%rdx, the DTV is at $fs_base + 8 (which matches what I've determined through source code analysis, again in this comment #710 (comment))And here it is for musl (x86_64):
It again is accessing it at $fs_base + 8 (as we can interpret from the source code, it should be at an offset of 8), but we need to do a bit more work to get this value from the disassembly.
Static TLS
As it happens, ruby can be significantly faster if we use a static build (with no libruby.so, all code directly in bin/ruby), and one of the reason is that you can read the value directly relative from $fsbase. Ruby checks the execution context a LOT, and so this ends up being a significant speedup. So, I'll use ruby as an example
In this model, we just need to look up a constant when we disassemble:
We also can't get away with reading the TLSDesc from the relocation table, because there isn't one. But, we should be able to calculate that constant ($0xfffffffffffffff0 is 16 in 2s compliment, which is the negative offset from $fsbase where the value is located, which we can compute using the TLS symbol value for
ruby_current_ec, and we don't need to access the DTV at all.cc @fabled