Low Latency Programming

30/09/2021 — David Gross — davidgross@optiver.com

About me

MSc Computer Science (France)
15+ years of programming
10+ years of C++
7 years at Optiver
C++ AutoTrading developer and Team Lead
https://github.com/david-grs

About this talk

Based on a Low Latency Workshop
A simple methodology for low latency programming
Some useful tools — mostly Linux
Actual C++ code
Questions: https://www.sli.do/ — Code: #599 376

About Optiver

Amsterdam, Sydney, Chicago, Shanghai
1986
440 people
45 nationalities

About Optiver

Market Making
Provide prices
Improve the market
Calculate the correct price
Being faster than the competition

What is low latency programming

It's not about throughput!
Ability to quickly react to an event
CPU-bound tasks, IO is outside of the scope
It's all about time

There is no silver bullet...

No compiler flags
No kernel config option
No library

...but there are golden rules

Measure
Do not optimize prematurely
Know your problem
Know your libraries
Know your hardware

How fast is fast?

CPU architecture - Intel Haswell

System tuning 101

RDTSC — Read Time-Stamp Counter


inline uint64_t rdtsc()
{
    uint64_t rax, rdx;
    asm volatile("rdtsc" : "=a"(rax), "=d"(rdx));
    return (rdx << 32) + rax;
}

System tuning 101

C-States, P-States
Dynamic frequency scaling incl TurboBoost
HyperThreading
Process isolation
Low Latency Performance Tuning for Red Hat Enterprise Linux 7

Step 1

Identify the hot path

Profilers don't work for events
Know which code is "hot"
Microbenchmark the hot code

Today's excercise


class Dictionary
{
public:
	Dictionary(const std::vector<std::string>& words);

	bool in_dictionary(std::string_view word) const;
};

Google Benchmark

Allows for easy creation of microbenchmarks
https://github.com/google/benchmark

Google Benchmark


void InDictionary(benchmark::State& state)
{
	int idx = 0;
	Dictionary dict(words);

	for (auto _: state)
	{
		std::string_view word = words[idx];
		idx = (idx+1) % words.size();
		benchmark::DoNotOptimize(dict.in_dictionary(word));
	}
}

Google Benchmark


void NotInDictionary(benchmark::State& state)
{
	int idx = 0;
	Dictionary dict(words);

	for (auto _: state)
	{
		std::string_view word = unknown_words[idx];
		idx = (idx+1) % unknown_words.size();
		benchmark::DoNotOptimize(dict.in_dictionary(word));
	}
}

Google Benchmark


BENCHMARK(InDictionary);
BENCHMARK(NotInDictionary);
BENCHMARK_MAIN();

Google Benchmark

Run on (4 X 3504 MHz CPU s)
2018-01-17 14:19:55
-----------------------------------------------------------------------
Benchmark                                Time           CPU Iterations
-----------------------------------------------------------------------
InDictionary                           725 ns        725 ns     884585
NotInDictionary                        812 ns        811 ns     838950

Naive implementation


class Dictionary
{
public:
	Dictionary(const std::vector<std::string>& words)
		: _container(words.begin(), words.end())
	{}


	bool in_dictionary(std::string_view word) const
	{
		return _container.find({word.data(), word.size()}) != _container.end();
	}

private:
	std::set<std::string> _container;
};

Step 2

Algorithmic complexity

Valgrind & Callgrind

Instrumentation framework
Unix only
Callgrind - profiler
KCachegrind - GUI

Valgrind & Callgrind

$ valgrind --tool=callgrind ./benchmark 
$ kcachegrind

KCachegrind

$ valgrind --tool=callgrind ./benchmark 
$ kcachegrind

Naive implementation


class Dictionary
{
public:
	Dictionary(const std::vector<std::string>& words)
		: _container(words.begin(), words.end())
	{}


	bool in_dictionary(std::string_view word) const
	{
		return _container.find({word.data(), word.size()}) != _container.end();
	}

private:
	std::set<std::string> _container;
};

Improved implementation


class Dictionary
{
public:
	Dictionary(const std::vector<std::string>& words)
		: _container(words.begin(), words.end())
	{}


	bool in_dictionary(std::string_view word) const
	{
		return _container.find({word.data(), word.size()}) != _container.end();
	}

private:
	std::unordered_set<std::string> _container; // <-- O(1)
};

Benchmark: Improving algorithmic complexity

Step 3

Allocations

KCachegrind output

Improved implementation


class Dictionary
{
public:
	Dictionary(const std::vector<std::string>& words)
		: _container(words.begin(), words.end())
	{}


	bool in_dictionary(std::string_view word) const
	{
		// This line allocates, creating std::string from std::string_view
		return _container.find({word.data(), word.size()}) != _container.end();
	}

private:
	std::unordered_set<std::string> _container; // Stores std::string
};

Non-allocating implementation


class Dictionary
{
public:
	Dictionary(const std::vector<std::string>& words)
		: _data(words.begin(), words.end())
		, _index(_data.begin(), _data.end())
	{}


	bool in_dictionary(std::string_view word) const
	{
		return _index.find(word) != _index.end(); // No allocation!
	}

private:
	std::vector<std::string> _data; // Stores std::string
	std::unordered_set<std::string_view> _index; // Stores views
};

KCachegrind output

Benchmark: Removing allocations

Step 4

Cache coherence

std::unordered_set

Open addressing

Papi++

Provides access to Hardware Performance Counters from within the program
https://github.com/david-grs/papipp

Papi - example use


	papi::event_set<PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_BR_MSP, PAPI_L1_DCM> events;
	events.start_counters();

	// Micro-benchmark loop is here

	events.stop_counters();
	std::cout
		<< events.get<PAPI_L1_DCM>().counter() / state.iterations()
		<< " L1 cache misses";

Papi - output

98 L1 cache misses
16 L1 cache misses
7.89 L1 cache misses
6.617 L1 cache misses
6.3805 L1 cache misses
6.03766 L1 cache misses
6.03299 L1 cache misses
6.03683 L1 cache misses

Open addressing implementation


	struct Entry
	{
		std::size_t hash;
		const std::string* string = nullptr;
	};

	std::vector<Entry> hashTable_;

Open addressing implementation


	bool in_dictionary(std::string_view word) const
	{
		auto hash = std::hash<std::string_view>()(word);
		auto idx = hash % hashTable_.size();

		while (true)
		{
			const Entry& entry = hashTable_[idx];

			// detect bucket
			if (!entry.string)
				return false;

			// detect match
			if(entry.hash == hash && *entry.string == word)
				return true;

			// probe
			idx = (idx + 1) % hashTable_.size();
		}
	}

PAPI++: Cache misses

Benchmark: Open addressing

Step 5

CPU instructions

Perf

"The official" Linux profiler
Has access to the same counters as PAPI
Instruments running binary
Produces nice reports
And so much more!

Perf report

  0,12 │       mov    %edx,0x8(%rsp)
       │       mov    $0xc70f6907,%edx
  0,16 │     → callq  std::_Hash_bytes(void const*, unsigned long, unsigned long)@plt
       │                     auto idx = hash % hashTable_.size();
       │       xor    %edx,%edx
  0,12 │       mov    %rax,%r14
 18,01 │       div    %rbx
       │           const_reference operator[](size_type __n) const _GLIBCXX_NOEXCEPT
       │           { return *(this->_M_impl._M_start + __n); }
  0,08 │       mov    %rdx,%rax
       │       mov    %rdx,%r12
  0,47 │       shl    $0x4,%rax
  0,27 │       add    %r15,%rax
       │                             if (!entry.string)
 35,52 │       mov    0x8(%rax),%rdx
  0,12 │       mov    %rax,%rcx
       │       test   %rdx,%rdx
  0,47 │     ↓ jne    1a4
       │     ↓ jmpq   230
       │       nop
       │                             idx = (idx+1) % hashTable_.size();
  3,06 │180:   lea    0x1(%r12),%rax
       │       xor    %edx,%edx
  5,89 │       div    %rbx
       │       mov    %rdx,%rcx
       │       mov    %rdx,%r12

Finding offending code


	bool in_dictionary(std::string_view word) const
	{
		auto hash = std::hash<std::string_view>()(word);
		auto idx = hash % hashTable_.size(); // <-- Flagged by perf

		while (true)
		{
			const Entry& entry = hashTable_[idx];

			// detect bucket
			if (!entry.string)
				return false;

			// detect match
			if(entry.hash == hash && *entry.string == word)
				return true;

			// probe
			idx = (idx + 1) % hashTable_.size(); // <-- Flagged by perf
		}
	}

Fixing offending code


	bool in_dictionary(std::string_view word) const
	{
		auto hash = std::hash<std::string_view>()(word);
		auto idx = hash & (hashTable_.size() - 1); // Fixed?

		while (true)
		{
			const Entry& entry = hashTable_[idx];

			// detect bucket
			if (!entry.string)
				return false;

			// detect match
			if(entry.hash == hash && *entry.string == word)
				return true;

			// probe
			idx = (idx + 1) & (hashTable_.size() - 1); // Fixed?
		}
	}

PAPI++: Instructions per cycle

Benchmark: Removing DIV

Summary

Always measure
Keep an eye on algorithmic complexity
Avoid allocations / synchronization
Use cache efficiently
Look into generated assembly

Links, resources, tools

https://david-grs.github.io/low_latency_presentation/
“Counting Nanoseconds Microbenchmarking C++ Code", David Gross, Meeting C++ 2019
“When a Microsecond Is an Eternity", Carl Cook, CppCon 2017
Papi++ : https://github.com/david-grs/papipp
Google Benchmark
valgrind --tool=callgrind, KCachegrind
perf
http://www.optiver.com

Low Latency Programming

About me

About this talk

About Optiver

About Optiver

What is low latency programming

There is no silver bullet...

...but there are golden rules

How fast is fast?

CPU architecture - Intel Haswell

CPU architecture - Intel Haswell

System tuning 101

RDTSC — Read Time-Stamp Counter

System tuning 101

System tuning 101

System tuning 101

Step 1

Identify the hot path

Identify the hot path

Today's excercise

Google Benchmark

Google Benchmark

Google Benchmark

Google Benchmark

Google Benchmark

Naive implementation

Step 2

Algorithmic complexity

Valgrind & Callgrind

Valgrind & Callgrind

KCachegrind

Naive implementation

Improved implementation

Benchmark: Improving algorithmic complexity

Step 3

Allocations

KCachegrind output

KCachegrind output

Improved implementation

Non-allocating implementation

KCachegrind output

Benchmark: Removing allocations

Step 4

Cache coherence

std::unordered_set

std::unordered_set

Open addressing

Papi++

Papi - example use

Papi - output

Open addressing implementation

Open addressing implementation

PAPI++: Cache misses

Benchmark: Open addressing

Step 5

CPU instructions

Perf

Perf report

Finding offending code

Fixing offending code

PAPI++: Instructions per cycle

Benchmark: Removing DIV

Summary

Links, resources, tools

Questions?