static_any: a low-latency stack-based Boost.Any

read

When writing some speed-critical code, I wanted to use Boost.Any as an attribute in a struct. This struct was small — 2 cache lines — and I knew that all the types that I would store in this Any would fit in 16bytes.

I thought “trivial, let’s use Boost.Any with a stack-based allocator”!
… and then I got disappointed.

Disappointed, because Boost.Any does not offer the possibility to use a custom allocator. To be honest, that would not have solved the story anyway, as Boost.Any does other things that slow down this container, but that was enough to start thinking about another — home-made — solution.

Talk is cheap. Show me the code.
static_any project on my github

Boost.Any implementation

So why was Boost.Any too slow in my case ? And why was I looking for a stack-based allocator ?

Boost.Any is very simple: it erases the type via inheritance, and the few required operations, which are querying the type and cloning the objects, are done via virtual methods. Here is a skeleton:

struct any { struct placeholder { ... }; template <typename _T> struct holder : public placeholder { _T; ... } template <typename _T> any(const _T& t) : m_placeholder(new holder<_T>(t)) {} private: holder* m_placeholder; };

My main issue regarding this implementation was not at all with the costs of calling virtual methods, but with the memory layout that this would imply to my structure.

I knew that my stored types were small, and I wanted them on the same cache line.

To be more concrete, let’s take the following code:

The variables i and l are next to each other in memory, but the integer 1234 will be somewhere else — due to the heap-based implementation of Boost.Any — and maybe even on another page.

And of course, you get all the benefits when having these variables located in the same contiguous chunk of memory: better data-locality means less memory cache misses and a better prefetching.

The benchmark

As always when talking about speed, we want to see some numbers. So here you are — it is in nanoseconds, the lower the better:

Time spent on assignment and get operations in nanoseconds

The reason static_any is faster on assignment is mainly due to the fact that is does not do a memory allocation. On the get operation, it is due to virtual calls and other implementation details.

But the main thing, the reason I wanted to have such generic container, i.e. a stack-based any, is not shown by these numbers. Because when you benchmark a piece of code, you run this one in a loop, and you do not get any cache misses… These numbers are just for raw sppeed and most of the time, instructions per cycle for such operations won’t be your bottleneck, but memory can be one.

static_any

The usage is as simple as Boost.Any:

static_any<32> a = 1234; int x = a.get<int>(); // returns 1234 bool bi = a.has<int>(); // returns true bool bd = a.has<double>(); // returns false double d = a.get<double>(); // throws! a = std::string("hello world"); // moved to a

At the beginning, I only implemented it for trivially copyable types, as I was only using this container with such objects. This super simple and even faster container — but unsafe, as there is no type checking at runtime — is in the same header any.hpp under the name of static_any_t.

Later, after few discussions with my workmate Maciek, he got the awesome idea about the gateway function that allows static_any to go from the erased type — the vector of bytes that is used as underlying in static_any — to the real type T that is stored.

I will describe that in a later post, meanwhile you can grab the code…

static_any: a low-latency stack-based Boost.Any

David Gross

Boost.Any implementation

The benchmark

static_any

Written by

David Gross

Supported by

Thoughts from a Wall Street developer

A blog about C++, with an emphasis on low-latency, performance measurements and system programming.