520 lines
25 KiB
Markdown
520 lines
25 KiB
Markdown
---
|
||
title: C++ Multithreading
|
||
---
|
||
Computers have to handle many different things at once, for example
|
||
screen, keyboard, drives, database, internet.
|
||
|
||
These are best represented as communicating concurrent processes, with
|
||
channels, as in Go routines. Even algorithms that are not really handling
|
||
many things at once, but are doing a single thing, such as everyone’s
|
||
sample program, the sieve of Eratosthenes, are cleanly represented as
|
||
communicating concurrent processes with channels.
|
||
|
||
[asynch await]:../client_server.html#the-equivalent-of-raii-in-event-oriented-code
|
||
|
||
On the other hand, also, not quite so cleanly, represented by [asynch await] which makes for much lighter weight code, more cleanly interfaceable with C++.
|
||
|
||
Concurrency is not the same thing as parallelism.
|
||
|
||
A node.js program is typically thousands of communicating concurrent
|
||
processes, with absolutely no parallelism, in the sense that node.js is single
|
||
threaded, but a node.js program typically has an enormous number of code
|
||
continuations, each of which is in effect the state of a concurrent
|
||
communicating process. Lightweight threads as in Go are threads that on
|
||
hitting a pause get their stack state stashed into an event handler and
|
||
executed by event oriented code, so one can always accomplish the same
|
||
effect more efficiently by writing directly in event oriented code.
|
||
|
||
And it is frequently the case that when you cleverly implement many
|
||
concurrent processes with more than one thread of execution, so that some
|
||
of your many concurrent processes are executed in parallel, your program
|
||
runs slower, rather than faster.
|
||
|
||
C++ multithreading is written around a way of coding that in practice does
|
||
not seem all that useful – parallel bitbashing. The idea is that you are
|
||
doing one thing, but dividing that one thing up between several threads to get
|
||
more bits bashed per second, the archetypical example being a for loop
|
||
performed in parallel, and then all the threads join after the loop is
|
||
complete.
|
||
|
||
The normal case however is that you want to manage a thousand things at
|
||
once, for example a thousand connections to the server. You are not
|
||
worried about how many millions of floating point operations per second,
|
||
but you are worried about processes sitting around doing nothing while
|
||
waiting for network or disk operations to complete.
|
||
|
||
For this, you need concurrent communicating processes, as in Go or event
|
||
orientation as in node.js or nginx, node.js, not necessarily parallelism,
|
||
which C++ threads are designed around.
|
||
|
||
The need to deal with many peers and a potentially enormous number of
|
||
clients suggests multiprocessing in the style of Go and node.js, rather than
|
||
what C++ multiprocessing is designed around, suggests a very large
|
||
number of processes that are concurrent, but not all that parallel, rather
|
||
than a small number of processes that are concurrent and also substantially
|
||
parallel. Representing a process by a thread runs into troubles at around
|
||
sixty four threads.
|
||
|
||
It is probably efficient to represent interactions between peers as threads,
|
||
but client/peer are going to need either events or Go lightweight threads,
|
||
and client/client interactions are going to need events.
|
||
|
||
Existing operating systems run far more than sixty four threads, but this
|
||
only works because grouped into processes, and most of those processes
|
||
inactive. If you have more than sixty four concurrently active threads in an
|
||
active process, with the intent that half a dozen or so of those active
|
||
concurrent threads will be actually executing in parallel, as for example a
|
||
browser with a thread for each tab, and sixty four tabs, that active process
|
||
is likely to be not very active.
|
||
|
||
Thus scaling Apache, whether as threads on windows or processes under
|
||
Linux, is apt to die.
|
||
|
||
# Need the solutions implemented by Tokio, Actix, Node.js and Go
|
||
|
||
Not the solutions supplied by the C++ libraries, because we are worrying
|
||
about servers, not massive bit bashing.
|
||
|
||
Go routines and channels can cleanly express both the kind of problems
|
||
that node.js addresses, and also address the kind of problem that C++
|
||
threads address, typically that you divide a task into a dozen subtasks, and
|
||
then wait for them all to complete before you take the next step, which are
|
||
hard to express as node.js continuations. Goroutines are a more flexible
|
||
and general solution, that make it easier to express a wider range of
|
||
algorithms concisely and transparently, but I am not seeing any mass rush
|
||
from node.js to Go. Most of the time, it is easy enough to write in code
|
||
continuations inside an event handler.
|
||
|
||
The general concurrent task that Google’s massively distributed database
|
||
is intended to express is that you have a thousand tasks each of which
|
||
generate a thousand outputs, which get sorted, and each of the enormous
|
||
number of items that sort into the same equivalence group gets aggregated
|
||
in a commutative operation, which can therefore be handled by any
|
||
number of processes in any order, and possibly the entire sort sequence
|
||
gets aggregated in an associative operation, which can therefore be
|
||
handled by any number of processes in any order.
|
||
|
||
The magic in the Google massively parallel database is that one can define a
|
||
a massively parallel operation on a large number of items in a database
|
||
simultaneously, much as one defines a join in SQL, and one can define
|
||
another massively parallel operation as commutative and or associative
|
||
operations on the sorted output of such a massively parallel operation. But
|
||
we are not much interested in this capability. Though something
|
||
resembling that is going to be needed when we have to shard.
|
||
|
||
# doing node.js in C++
|
||
|
||
Dumb idea. We already have the node.js solution in a Rust library.
|
||
|
||
Actix and Tokio are the (somewhat Cish) solutions.
|
||
|
||
## Use Go
|
||
|
||
Throw up hands in despair, and provide an interface linking Go to secure
|
||
Zooko ids, similar to the existing interface linking it to Quic and SSL.
|
||
|
||
This solution has the substantial advantage that it would then be relatively
|
||
easy to drop in the existing social networking software written in Go, such
|
||
as Gitea.
|
||
|
||
We probably don’t want Go to start managing C++ spawned threads, but
|
||
the Go documentation seems to claim that when a Go heavyweight thread
|
||
gets stuck at a C mutex while executing C code, Go just spawns another to
|
||
deal with the lightweight threads when the lightweight threads start piling
|
||
up.
|
||
|
||
When a C++ thread wants to despatch an event to Go, it calls a Go routine
|
||
with a select and a default, so that the Go routine will never attempt to
|
||
pause the C++ spawned thread on the assumption that it is a Go spawned
|
||
thread. But it would likely be safer to call Goroutines on a thread that was
|
||
originally spawned by Go.
|
||
|
||
## doing it in C the C way
|
||
|
||
Processes represented as threads. Channels have a mutex. A thread grabs
|
||
total exclusive ownership of a channel whenever it takes something out or
|
||
puts something in. If a channel is empty or full, it then waits on a
|
||
condition on the mutex, and when the other thread grabs the mutex and
|
||
makes the channel ready, it notices that the other process or processes are
|
||
waiting on condition, the condition is now fulfilled, and sends a
|
||
notify_one.
|
||
|
||
Or, when the channel is neither empty nor full, we have an atomic spin lock,
|
||
and when sleeping might become necessary, then we go to full mutex resolution.
|
||
|
||
Which implies a whole pile of data global to all threads, which will have
|
||
to be atomically changed.
|
||
|
||
This can be done by giving each thread two buffers for this global data
|
||
subject to atomic operations, and single pointer or index that points to the
|
||
currently ruling global data set. (The mutex is also of course global, but
|
||
the flag saying whether to use atomics or mutex is located in a data
|
||
structure managed by atomics.)
|
||
|
||
When a thread wants to atomically update a large object (which should be
|
||
sixty four byte aligned) it constructs a copy of the current object, and
|
||
atomically updates the pointer to the copy, if the pointer was not changed
|
||
while it was constructing. The object is immutable while being pointed at.
|
||
|
||
Or we could have two such objects, with the thread spinning if one is in
|
||
use and the other already grabbed, or momentarily sleeping if an atomic
|
||
count indicates other threads are spinning on a switch awaiting
|
||
completion.
|
||
|
||
The read thread, having read, stores its read pointer atomically with
|
||
`memory_order_release`, ored with the flag saying if it is going to full
|
||
mutex resolution. It then reads the write pointer with
|
||
`memory_order_acquire`, that the write thread atomically wrote with
|
||
`memory_order_release`, and if all is well, keeps on reading, and if it is
|
||
blocked, or the write thread has gone to mutex resolution, sets its mutex
|
||
resolution flag and proceeds to mutex resolution. When it is coming out of
|
||
mutex resolution, about to release the mutex, it clears its mutex resolution
|
||
flag. The mutex is near the flags by memory location, all part of one object
|
||
that contains a mutex and atomic variables.
|
||
|
||
So the mutex flag is atomically set when the mutex has not yet been
|
||
acquired, but the thread is unconditionally going to acquire it, but non
|
||
atomically cleared when the mutex still belongs to the thread, but is
|
||
unconditionally going to release it.
|
||
|
||
If many read threads reading from one channel, then each thread has to
|
||
`memory_order_acquire` the read pointer, and then, instead of
|
||
`memory_order_release`ing it, has to do an
|
||
`atomic_compare_exchange_weak_explicit`, and if it changed while it was
|
||
reading abort its reads and start over.
|
||
|
||
Similarly if many write threads writing to one channel, each write thread
|
||
will have first spin lock acquire the privilege of being the sole write thread
|
||
writing, or spin lock acquire a range to write to. Thus in the most general
|
||
case, we have a spin locked atomic write state that specifies an area that
|
||
has been written to, an area that is being written to, and an area that is
|
||
available to be acquired for writing, a spin locked atomic read state, and
|
||
mutex that holds both the write state and the read state. In the case of a
|
||
vector buffer with multiple writers, the atomic states are three wrapping
|
||
atomic pointers that go through the buffer in the same direction,
|
||
|
||
We would like to use direct memory addresses, rather than vector or deque
|
||
addresses, which might require us to write our own vector or deque. See
|
||
the [thread safe deque](https://codereview.stackexchange.com/questions/238347/a-simple-thread-safe-deque-in-c "A simple thread-safe Deque in C++"), which however relies entirely on locks and mutexes,
|
||
and whose extension to atomic locks is not obvious.
|
||
|
||
Suppose you are doing atomic operations, but some operations might be
|
||
expensive and lengthy. You really only want to spin lock on amending data
|
||
that is small and all in close together in memory, so on your second spin,
|
||
the lock has likely been released.
|
||
|
||
Well, if you might need to sleep a thread, you need a regular mutex, but
|
||
how are you going to interface spin locks and regular mutexes?
|
||
|
||
You could cleverly do it with notifies, but I suspect it is costly compared
|
||
to just using a plain old vanilla mutex. Instead you have some data
|
||
protected by atomic locks, and some data protected by regular old
|
||
mutexes, and any time the data protected by the regular old mutex might
|
||
change, you atomically flag a change coming up, and every thread then
|
||
grabs the mutex in order to look amend or even look at the data, until on
|
||
coming out of the mutex with the data, they see the flag saying the mutex
|
||
protected data might change is now clear.
|
||
|
||
After one has flagged the change coming up, and grabbed the mutex, wha
|
||
happens if another thread is cheerfully amending the data in a fast
|
||
operation, having started before you grabbed the mutex? The other thread
|
||
has to be able to back out of that, and then try again, this try likely to be
|
||
with mutex resolution. But what if the other thread wants to write into a
|
||
great big vector, and reallocations of the vector are mutex protected. And
|
||
we want atomic operations so that not everyone has to grab the mutex every
|
||
time.
|
||
|
||
Well, any time you want to do something to the vector, it fits or it does not.
|
||
And if it does not fit, then mutex time. You want all threads to switch
|
||
to mutex resolution, before any thread actually goes to work reallocating
|
||
the vector. So you are going to have to use the costly notify pattern. “I am
|
||
out of space, so going to sleep until I can use the mutex to amend the
|
||
vector. Wake me up when last thread using atomics has stopped using
|
||
atomics that directly reference memory, and has switched to reading the
|
||
mutex protected data, so that I can change the mutex protected data.”
|
||
|
||
The std::vector documentation says that vector access is just as efficient as
|
||
array access, but I am a little puzzled by this claim, as a vector can be
|
||
moved, and specifically requests that you have a no throw move operation for
|
||
optimization, and having a no copy is standard where it contains things that
|
||
might have ownership. (Which leads to complications when one has containers
|
||
of containers, since C++ is apt to helpfully generate a broken copy
|
||
implementation.)
|
||
|
||
Which would suggest that vector access is through indirection, and
|
||
indirects with threading create problems.
|
||
|
||
## lightweight threads in C
|
||
|
||
A lightweight thread is just a thread where, whenever a lightweight thread
|
||
needs to be paused by its heavyweight thread, the heavyweight thread
|
||
stores the current stack state in the heap, and move on to deal with other
|
||
lightweight threads that need to be taken care of. Which collection of
|
||
preserved lightweight thread stack states amount to a pile of event
|
||
handlers that are awaiting events, and having received events, are then
|
||
waiting for a heavyweight thread to process that event handler.
|
||
|
||
Thus one winds up with what suspect it the Tokio solution, a stack that
|
||
is a tree, rather than a stack.
|
||
|
||
Hence the equivalence between node.js and nginx event oriented
|
||
programming, and Go concurrent programming.
|
||
|
||
# costs
|
||
|
||
Windows 10 is limited to sixty four threads total. If you attempt to create
|
||
more threads than that, it still works, but performance is apt to bite, with
|
||
arbitrary and artificial thread blocking. Hence goroutines, that implement
|
||
unofficial threads inside the official threads.
|
||
|
||
Thread creation and destruction is fast, five to twenty microseconds, so
|
||
thread pools do not buy you much, except that your memory is already
|
||
going to be cached. Another source says 40 microseconds on windows,
|
||
and fifty kilobytes per thread. So, a gigabyte of ram could have twenty
|
||
thousand threads hanging around. Except that the windows thread
|
||
scheduler dies on its ass.
|
||
|
||
There is a reasonable discussion of thread costs [here](https://news.ycombinator.com/item?id=22456642)
|
||
|
||
General message is that lots of languages have done it better, often
|
||
immensely better, Go among them.
|
||
|
||
Checking the C++ threading libraries, they all single mindedly focus on
|
||
the particular goal of parallelizing computationally intensive work. Which
|
||
is not in fact terribly useful for anything you are interested in doing.
|
||
|
||
# Atomics
|
||
|
||
```C++
|
||
typedef enum memory_order {
|
||
memory_order_relaxed, // relaxed
|
||
memory_order_consume, // consume
|
||
/* No one, least of all compiler writers, understands what
|
||
"consume" does.
|
||
It has consequences which are difficult to understand or predict,
|
||
and which are apt to be inconsistent between architectures,
|
||
libraries, and compilers. */
|
||
memory_order_acquire, // acquire
|
||
memory_order_release, // release
|
||
memory_order_acq_rel, // acquire/release
|
||
memory_order_seq_cst // sequentially consistent
|
||
/* "sequentially consistent" interacts with the more commonly\
|
||
used acquire and release in ways difficult to understand or
|
||
predict, and in ways that compiler and library writers
|
||
disagree on. */
|
||
} memory_order;
|
||
```
|
||
|
||
I don’t think I understand how to use atomics correctly.
|
||
|
||
`Atomic_compare_exchange_weak_explicit` inside a while loop is
|
||
a spin lock, and spin locks are complicated, apt to be inefficient,
|
||
potentially catastrophic, and avoiding catastrophe is subtle and complex.
|
||
|
||
To cleanly express a concurrent algorithm you need a thousand
|
||
communicating processes, as goroutines or node.js continuations, nearly
|
||
all of which are sitting around waiting for the another thing to send them
|
||
a message or be ready to receive their message, while atomics give you a
|
||
fixed small number of threads all barreling full speed ahead. Whereupon
|
||
you find yourself using spin locks.
|
||
|
||
Rather than moving data between threads, you need to move threads between
|
||
data, between one continuation and the next.
|
||
|
||
Well, if you have a process that interacts with Sqlite, each thread has to
|
||
have its own database connection, in which case it needs to be a pool of
|
||
threads maybe you have a pool of database threads that do work received
|
||
from a bunch of asynch tasks through a single fixed sized fifo queue, and
|
||
send the results back through another fifo queue, with threads waking up
|
||
when the queue gets more stuff in it, and going to sleep when the queue
|
||
empties, with the last thread signalling “wake me up when there is
|
||
something to do”, and pushback happening when buffer is full.
|
||
|
||
Go demonstrates that you can cleanly express algorithms as concurrent
|
||
communicating processes using fixed size channels. An unbuffered
|
||
channel is just a coprocess, with a single thread of execution switching
|
||
between the two coprocesses, without any need for locks or atomics, but
|
||
with a need for stack fixups. But Node.js seems to get by fine with code
|
||
continuations instead of Go’s stack fixups.
|
||
|
||
A buffered channel is just a fixed size block of memory with alignment,
|
||
size, and atomic wrapping read and write pointers.
|
||
|
||
Why do they need to be atomic?
|
||
|
||
So that the read thread can acquire the write pointer to see how much data
|
||
is available, and release the read pointer so that the write thread can
|
||
acquire the read pointer to see how much space is available, and
|
||
conversely the write thread acquires the read pointer and releases the write
|
||
pointer.And when write thread updates the write pointer it updates it *after*
|
||
writing the data and does a release on the write pointer atomic, so that
|
||
when the read thread does an acquire on the write pointer, all the data that
|
||
the write pointer says was written will actually be there in the memory that
|
||
read thread is looking at.
|
||
|
||
Multiple routines can send data into a single channel, and, with select, a
|
||
single channel can receive data from any channels.
|
||
|
||
But, with go style programming, you are apt to have far more routines
|
||
than actual hardware threads servicing them, so you are still going to need
|
||
to sleep your threads, making atomic channels an optimization of limited
|
||
value.
|
||
|
||
Your input buffer is empty. If you have one thread handling the one
|
||
process for that input stream, going to have to sleep it. But this is costly.
|
||
Better to have continuations that get executed when data is available in the
|
||
channel, which means your channels are all piping to one thread, that then
|
||
calls the appropriate code continuation. So how is one thread going to do a
|
||
select on a thousand channels?
|
||
|
||
Well, we have a channel full of channels that need to be serviced. And
|
||
when that channel empties, mutex.
|
||
|
||
Trouble is, I have not figured out how to have a thread wait on multiple
|
||
channels. The C++ wait function does not implement a select. Well, it
|
||
does, but you need a condition statement that looks over all the possible
|
||
wake conditions. And it looks like all those wake conditions have to be on
|
||
a single mutex, on which there is likely to be a lot of contention.
|
||
|
||
It seems that every thread grabs the lock, modifies the data protected by
|
||
the lock, performs waits on potentially many condition variables all using
|
||
the same lock and protected by the same lock, condition variables that
|
||
look at conditions protected by the lock, then releases the lock
|
||
immediately after firing the notify.
|
||
|
||
But it could happen that if we try to avoid unnecessarily grabbing the
|
||
mutex, one thread sees the other thread awake, just when it is going to
|
||
sleep, so I fear I have missed a spin lock somewhere in this story.
|
||
|
||
If we want to avoid unnecessary resort to mutex, we have to spin lock on a
|
||
state machine that governs entry into mutex resolution. Each thread makes
|
||
its decision based on the current state of channel and state machine, an
|
||
does a `Atomic_compare_exchange_weak_explicit` to amend the state of the
|
||
state machine. If the state machine has not changed, the decision goes
|
||
through. If the state machine was changed, presumably by the other thread,
|
||
it re-evaluates its decision and tries again.
|
||
|
||
Condition variables are designed to support the case where you have one
|
||
thread or a potentially vast pool of threads waiting for work, but are not
|
||
really designed to address the case where one thread is waiting for work
|
||
from a potentially vast pool of threads, and I rather think I will have to
|
||
handcraft a handler for this case from atomics and, ugh, dangerous spin
|
||
loops implemented in atomics.
|
||
|
||
A zero capacity Go channel sort of corresponds to a C++ binary
|
||
semaphore. A finite and small Go channel sort of corresponds to C++
|
||
finite and small semaphore. Maybe the solution is semaphores, rather than
|
||
atomic variables. But I am just not seeing a match.
|
||
|
||
I notice that notifications seems to be built out of a critical section, with
|
||
lots of grabbing a mutex and releasing a mutex, with far too much
|
||
grabbing a mutex and releasing a mutex. Under the hood, likely a too-clever
|
||
and complicated use of threads piling up on the same critical
|
||
section. So maybe we need some spin state atomic state machine system
|
||
that drops spinning threads to wait on a semaphore. Each thread on a
|
||
channel drops the most recent state channel after reading, and most recent
|
||
state after writing, onto an atomic variable.
|
||
|
||
But the most general case is many to many, with many processes doing a
|
||
select on many channels. We want a thread to sleep if all the channels on
|
||
which it is doing a select are blocked on the operation it wants to do, and
|
||
we want processes waiting on a channel to keep being woken up, one at a
|
||
time, as long a channel has stuff that processes are waiting on.
|
||
|
||
# C++ Multithreading
|
||
|
||
`std:aysnc` is designed to support the case where threads spawn more
|
||
threads if there is more work to do, and the pool of threads is not too large,
|
||
and threads terminate when they are out of work, or do the work
|
||
sequentially if doing it in parallel seems unlikely do yield benefits. C++ by
|
||
default manages the decision for you.
|
||
|
||
Maybe the solution is to use threads where we need stack state, and
|
||
continuations serviced by a single thread where we expect to handle one
|
||
and only one reply. Node.js gets by fine on one thread and one database
|
||
connection.
|
||
|
||
```C++
|
||
#include &t;thread>
|
||
static_assert(__STDCPP_THREADS__==1, "Needs threads");
|
||
// As thread resources have to be managed, need to be wrapped in
|
||
// RAII
|
||
class ThreadRAII {
|
||
std::thread & m_thread;
|
||
public:
|
||
// As a thread object is moveable but not copyable, the thread obj
|
||
// needs to be constructed inside the invocation of the ThreadRAII
|
||
// constructor. */
|
||
ThreadRAII(std::thread & threadObj) : m_thread(threadObj){}
|
||
~ThreadRAII(){
|
||
// Check if thread is joinable then detach the thread
|
||
if(m_thread.joinable()){
|
||
m_thread.detach();
|
||
}
|
||
}
|
||
};
|
||
```
|
||
|
||
Examples of thread construction
|
||
|
||
```C++
|
||
void foo(char *){
|
||
…
|
||
}
|
||
|
||
class foo_functor
|
||
{
|
||
public:
|
||
void operator()(char *){
|
||
…
|
||
}
|
||
};
|
||
|
||
|
||
int main(){
|
||
ThreadRAII thread_one(std::thread (foo, "one"));
|
||
ThreadRAII thread_two(
|
||
std::thread (
|
||
(foo_functor()),
|
||
"two"
|
||
)
|
||
);
|
||
const char three[]{"three"};
|
||
ThreadRAII thread_lambda(
|
||
std::thread(
|
||
[three](){
|
||
…
|
||
}
|
||
)
|
||
);
|
||
}
|
||
```
|
||
|
||
C++ has a bunch of threading facilities that are designed for the case that
|
||
a normal procedural program forks a bunch of tasks to do stuff in parallel,
|
||
and then when they are all done, merges the results with join or promise
|
||
and future, and then the main program does its thing.
|
||
|
||
This is not so useful when the main program is a event oriented, rather
|
||
than procedural.
|
||
|
||
If the main program is event oriented, then each thread has to stick around
|
||
for the duration, and has to have its own event queue, which C++ does not
|
||
directly provide.
|
||
|
||
In this case threads communicate by posting events, and primitives that do
|
||
thread synchronization (promise, future, join) are not terribly useful.
|
||
|
||
A thread grabs its event queue, using the mutex, pops out the next event,
|
||
releases the mutex, and does its thing.
|
||
|
||
If the event queue is empty, then, without releasing it, the thread
|
||
processing events waits on a [condition variable](https://thispointer.com//c11-multithreading-part-7-condition-variables-explained/). (which wait releases the
|
||
mutex). When another thread grabs the event queue mutex and stuffs
|
||
something into into the event queue, it fires the [condition variable](https://thispointer.com//c11-multithreading-part-7-condition-variables-explained/), which
|
||
wakes up and restores the mutex of the thread that will process the event
|
||
queue.
|
||
|
||
Mutexes need to construct RAII objects, one of which we will use in
|
||
constructing the condition object.
|