read

When changing the implementation of a time clock, I had to switch from a static method — think about a classic call to clock_gettime() for example — to a per-thread model. Long story short, it was about a constant TSC that was not so constant across different sockets.

Per-thread and static variables: that sounds like a perfect case for Thread-Local Storage (TLS). However this time clock was widely used in my project and was taking so far a couple of CPU cycles, and I wanted to be sure that the TLS would not add a significant overhead.

A quick look on Google only revealed an article from the Intel Developer Zone: the hidden performance cost of accessing thread-local variables. Among pro-tips like “you simply have to buy a Intel©® VTune™ licence and don’t forget the Parallel Studio XE©®™ it only costs 2k$”, the awesome screenshots or the comment section where someone says that the code is slow due to the TLS implementation of Windows while another one points out the code was running under Linux, I simply decided to figure out myself.

So I did what I prefer: I profiled it.

TL;DR

If you have statically linked code, you can use a thread local variable like another one: there is only one TLS block and its entire data is known at link-time, which allows the pre-calculation of all offsets.

If you are using dynamic modules, there will be a lookup each time you access the thread local variable, but you have some options in order to avoid it in speed-critical code.

µ-benchmarking the access of TLS data

I wrote a snippet that mimics the code I had in production: one was accessing a static variable, the other one a thread local variable, through a static method of the class.

Note that the thread_local keyword was introduced in C++11, and implemented in GCC 4.8 or newer. If you are using an older version, you can use the __thread GCC keyword.

To time my experiments, I am using geiger, a benchmarking library I developed. geiger can read hardware performance counters — using the PAPI library — and is ideal to benchmark very short snippets of code.

Here, I simply declare a benchmark suite with two tests. The first one accesses static data, and the second one thread local data. The suite is run twice: the first time, each test is run once, and the second time, each test is run one hundred times.

#include "inc.hpp" #include <geiger/benchmark.h> #include <vector> int main(int argc, char **argv) { geiger::init(); geiger::suite<geiger::instr_profiler> s; Inc inc; IncTLS inc_tls; s.add("static", [&]() { inc(); }); s.add("tls", [&]() { inc_tls(); }); s.set_printer<geiger::printers::console>(); s.run(1); s.run(100); return 0; }

Let’s build it with our favorite flags:

The results are quite straightforward: no overhead at all. There is always a bit of jitter with hardware counters, so do not conclude we have more cycles with the TLS test, there is quite some variance here between several runs.

Test Time (ns) PAPI_TOT_INS PAPI_TOT_CYC PAPI_BR_MSP ---------------------------------------------------------- static 4 994 1699 12 tls 4 994 1812 10

The assembly code also confirms it. I build with CLang 3.6 but I got the same with GCC 4.7:

$ disass ./tls "Inc::operator()()" Dump of assembler code for function Inc::operator()(): 0x0000000000404f40 <+0>: inc DWORD PTR [rip+0x202382] 0x0000000000404f46 <+6>: ret End of assembler dump. $ disass ./tls "IncTLS::operator()()" Dump of assembler code for function IncTLS::operator()(): 0x0000000000404f50 <+0>: inc DWORD PTR fs:0xfffffffffffffffc 0x0000000000404f58 <+8>: ret End of assembler dump.

NB: the glibc is using the segment register FS for accessing the TLS

How about __tls_get_addr ?

Apparently, all is good, I can use TLS as I want. But hang on… how about this function that was in the callstack on the Intel blog, __tls_get_addr ?

One thing that always scares me the most when I do micro benchmarks is about the reliability of their results. Indeed, it is very easy to screw up. You have to generate some realistic inputs that will trigger the same behavior that you got in production, without too many side effects. And from the D-cache to the BTB, everything is hot…

Here, we have a sneaky case: the compilation is part of the environment of our benchmark. If the thread variable is in a dynamic module, the linker cannot compute a direct reference to it and we get the much vaunted lookup!

Let’s build it again, as a shared library this time:

cmake_minimum_required(VERSION 2.8) set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -g -O3 std=c++14") add_library(inc SHARED inc.cpp) add_executable(tls tls.cpp) target_link_libraries(tls inc geiger)

There is in this case a difference ; the method that accesses the thread local variable takes twice more time. Please not that this overhead is still very small, as in this case we compare it to an extremely cheap operation…

Test Time (ns) PAPI_TOT_INS PAPI_TOT_CYC PAPI_BR_MSP ---------------------------------------------------------- static 6 1194 2990 20 tls 12 2694 3568 15

The assembly code shows the same difference:

$ disass ./libinc.so "Inc::operator()()" Dump of assembler code for function Inc::operator()(): 0x0000000000000870 <+0>: mov rax,QWORD PTR [rip+0x200781] 0x0000000000000877 <+7>: inc DWORD PTR [rax] 0x0000000000000879 <+9>: ret End of assembler dump. $ disass ./libinc.so "IncTLS::operator()()" Dump of assembler code for function IncTLS::operator()(): 0x0000000000000880 <+0>: push rax 0x0000000000000881 <+1>: data32 lea rdi,[rip+0x20074f] 0x0000000000000889 <+9>: data32 data32 call 0x770 <__tls_get_addr@plt> 0x0000000000000891 <+17>: inc DWORD PTR [rax] 0x0000000000000893 <+19>: pop rax 0x0000000000000894 <+20>: ret End of assembler dump.

The TLS implementation in ELF

As global and static variables are stored in .data and .bss sections, thread local variables get stored in equivalent .tdata and .tbss sections.

Each thread has its on TCB — Thread Control Block — that contains some metadata about the thread itself. One metadata is the address of the DTV — Dynamic Thread Vector — which is a vector that contains pointers to thread local variables and other TLS blocks.

TLS implementation

When using statically linked code, everything is simpler as there is only one TLS block. As the offsets to the thread local variables are known at link-time, the compiler is able to generate a direct access to the thread local variable.

But when you have dynamic modules, several TLS blocks are present. It is easier to see with the following simplified version of __tls_get_addr. offset is known at build time, then module_id is the only variable, which is zero when using static linking.

void* __tls_get_offset(size_t module_id, size_t offset) { char* tls_block = dtv[thread_id][module_id]; if (tls_block == UNALLOCATED_TLS_BLOCK) tls_block = dtv[thread_id][module_id] = allocate_tls(module_id); return tls_block + offset; }

This is the actual implementation from the glibc 2.20:

void * __tls_get_addr (GET_ADDR_ARGS) { dtv_t *dtv = THREAD_DTV (); if (__glibc_unlikely (dtv[0].counter != GL(dl_tls_generation))) return update_get_addr (GET_ADDR_PARAM); void *p = dtv[GET_ADDR_MODULE].pointer.val; if (__glibc_unlikely (p == TLS_DTV_UNALLOCATED)) return tls_get_addr_tail (GET_ADDR_PARAM, dtv, NULL); return (char *) p + GET_ADDR_OFFSET; }

The ABI specific parts are implemented in sysdeps ; as for x86_64, we access the TCB from the segment register FS:

/* Return the address of the dtv for the current thread. */ # define THREAD_DTV() \ ({ struct pthread *__pd; \ THREAD_GETMEM (__pd, header.dtv); }) /* Read member of the thread descriptor directly. */ # define THREAD_GETMEM(descr, member) \ ({ __typeof (descr->member) __value; \ if (sizeof (__value) == 1) \ asm volatile ("movb %%fs:%P2,%b0" \ : "=q" (__value) \ : "0" (0), "i" (offsetof (struct pthread, member))); \ ....

So what?

Accessing thread local variables is cheap in both cases ; the lookup done by the glibc is very fast (and takes a constant time). But if your code is — or could be — loaded as a dynamic module and is speed-critical, then it can be worth spending some time to avoid this lookup.

One easy win could be to access the thread local data from the same block of code. The compiler will be smart enough to call only once __tls_get_addr. This could be done by, for example, inlining the function that access the TLS.

One could think about keeping the reference of the thread local variable, but I found this solution very dodgy. You might have to prepare your arguments to pass the code review :) …

More information about the TLS and its implementation on GNU/Linux in ELF Handling for TLS from Ulrich Drepper.

TLS performance overhead and cost on GNU/Linux

David Gross

TL;DR

µ-benchmarking the access of TLS data

How about __tls_get_addr ?

The TLS implementation in ELF

So what?

Written by

David Gross

Supported by

Thoughts from a Wall Street developer

A blog about C++, with an emphasis on low-latency, performance measurements and system programming.