wallet/docs/libraries/cpp_multithreading.md

675 lines
32 KiB
Markdown
Raw Normal View History

---
title: C++ Multithreading
...
Computers have to handle many different things at once, for example
screen, keyboard, drives, database, internet.
These are best represented as communicating concurrent processes, with
channels, as in Go routines. Even algorithms that are not really handling
many things at once, but are doing a single thing, such as everyones
sample program, the sieve of Eratosthenes, are cleanly represented as
communicating concurrent processes with channels.
[asynch await]:../client_server.html#the-equivalent-of-raii-in-event-oriented-code
On the other hand, also, not quite so cleanly, represented by [asynch await] which makes for much lighter weight code, more cleanly interfaceable with C++.
Concurrency is not the same thing as parallelism.
A node.js program is typically thousands of communicating concurrent
processes, with absolutely no parallelism, in the sense that node.js is single
threaded, but a node.js program typically has an enormous number of code
continuations, each of which is in effect the state of a concurrent
communicating process. Lightweight threads as in Go are threads that on
hitting a pause get their stack state stashed into an event handler and
executed by event oriented code, so one can always accomplish the same
effect more efficiently by writing directly in event oriented code.
And it is frequently the case that when you cleverly implement many
concurrent processes with more than one thread of execution, so that some
of your many concurrent processes are executed in parallel, your program
runs slower, rather than faster.
C++ multithreading is written around a way of coding that in practice does
not seem all that useful parallel bitbashing. The idea is that you are
doing one thing, but dividing that one thing up between several threads to get
more bits bashed per second, the archetypical example being a for loop
performed in parallel, and then all the threads join after the loop is
complete.
The normal case however is that you want to manage a thousand things at
once, for example a thousand connections to the server. You are not
worried about how many millions of floating point operations per second,
but you are worried about processes sitting around doing nothing while
waiting for network or disk operations to complete.
For this, you need concurrent communicating processes, as in Go or event
orientation as in node.js or nginx, node.js, not necessarily parallelism,
which C++ threads are designed around.
The need to deal with many peers and a potentially enormous number of
clients suggests multiprocessing in the style of Go and node.js, rather than
what C++ multiprocessing is designed around, suggests a very large
number of processes that are concurrent, but not all that parallel, rather
than a small number of processes that are concurrent and also substantially
parallel. Representing a process by a thread runs into troubles at around
sixty four threads.
It is probably efficient to represent interactions between peers as threads,
but client/peer are going to need either events or Go lightweight threads,
and client/client interactions are going to need events.
Existing operating systems run far more than sixty four threads, but this
only works because grouped into processes, and most of those processes
inactive. If you have more than sixty four concurrently active threads in an
active process, with the intent that half a dozen or so of those active
concurrent threads will be actually executing in parallel, as for example a
browser with a thread for each tab, and sixty four tabs, that active process
is likely to be not very active.
Thus scaling Apache, whether as threads on windows or processes under
Linux, is apt to die.
# Need the solutions implemented by Tokio, Actix, Node.js and Go
Not the solutions supplied by the C++ libraries, because we are worrying
about servers, not massive bit bashing.
Go routines and channels can cleanly express both the kind of problems
that node.js addresses, and also address the kind of problem that C++
threads address, typically that you divide a task into a dozen subtasks, and
then wait for them all to complete before you take the next step, which are
hard to express as node.js continuations. Goroutines are a more flexible
and general solution, that make it easier to express a wider range of
algorithms concisely and transparently, but I am not seeing any mass rush
from node.js to Go. Most of the time, it is easy enough to write in code
continuations inside an event handler.
The general concurrent task that Googles massively distributed database
is intended to express is that you have a thousand tasks each of which
generate a thousand outputs, which get sorted, and each of the enormous
number of items that sort into the same equivalence group gets aggregated
in a commutative operation, which can therefore be handled by any
number of processes in any order, and possibly the entire sort sequence
gets aggregated in an associative operation, which can therefore be
handled by any number of processes in any order.
The magic in the Google massively parallel database is that one can define a
a massively parallel operation on a large number of items in a database
simultaneously, much as one defines a join in SQL, and one can define
another massively parallel operation as commutative and or associative
operations on the sorted output of such a massively parallel operation. But
we are not much interested in this capability. Though something
resembling that is going to be needed when we have to shard.
# doing node.js in C++
Dumb idea. We already have the node.js solution in a Rust library.
Actix and Tokio are the (somewhat Cish) solutions. But Rust async is infamously
hard. The borrow checker goes mad trying figure lifetimes in async
## callbacks
In C, a callback is implemented as an ordinary function pointer, and a pointer to void,
which is then cast to a data structure of the desired type.
What the heavy C++ machinery of `std::function` does is bind the two together and then
do memory management after the fashion of `std::string`.
(but we probably need to do our own memory management, so need to write
our own equivalent of std funcction supporting a C rather than C++ api)
[compiler explorer]:https://godbolt.org/ {target="_blank"}
And `std::function`, used correctly, should compile to the identical code
merely wrapping the function pointer and the void pointer in a single struct
-- but you had better use [compiler explorer] to make sure
that you are using it correctly.
Write a callback in C, an an std::function in c++, and make sure that
the compiler generates what it should.
Ownership is going to be complicated -- since after createing and passing a callback, we
probably do not want ownership any more -- the thread is going to return
and be applied to some entirely different task. So the call that is passed the callback
as an argument by reference uses `move` to ensure that when the `std::function`
stack value in its caller pointing to the heap gets destroyed, it does not
free the value on the heap, and then stashes the moved `std::function` in some
safe place.
Another issue is that rust, python, and all the rest cannot talk to C++, they can only
talk C. On the other hand, the compiler will probably optimise the `std::function` that
consists of a lamda that is a call to function pointer and that captures a pointer to void.
Again, since compiler is designed for arcane optimization issues, have to see what happens
in [compiler explorer].
But rather than guessing about the compiler correctly guessing intent, make the
callback a C union type implementing std variant in C, being a union of `std:monostate`
a C callback taking no arguments, a C++ callback taking no arguments, C and C++ callbacks
taking a void pointer argument, a c++ callback that is a pointer to method, and
a C++ callback that is an `std::function`
In the old win32 apis, which were designed for C, and then retrofitted for C++
they would have a static member function that took an LPARAM, which was a pointer
to void pointing at the actual object, and then the static member function
would directly call the appropriate, usually virtual, actual member function.
Member function pointers have syntax that no one can wrap their brains around
so people wrap them in layers of typedefs.
Sometimes you want to have indefinitely many data structures, which are dynamically allocated
and then discarded.
Sometimes you want to have a single data structure that gets overwritten frequently. The latter is
preferable when it suffices, since it means that asynch callback code is more like sync code.
In one case, you would allocate the object every time, and when does with it, discard it.
In the other case it would be a member variable of struct that hangs around and is continually
re-used.
### C compatibility
Bind the two together in a way that C can understand:
The code that calls the callback knows nothing about how the blob is structured.
The event management code knows nothing about how the blob is structured.
But the function pointer in the blob *does* know how the blob is structured.
```C
// p points to or into a blob of data containing a pointer to the callback
// and the data that the callback needs is in a position relative to the pointer
// that is known to the callback function.
enum event_type { monovalue, reply, timeout, unreachable, unexpected_error };
struct ReplyTo;
typedef extern "C" void (*ReplyTo_)(ReplyTo * pp, void* RegardsTo, event_type evtype, void* event);
struct ReplyTo
{
ReplyTo_ p;
};
ReplyTo * pp;
RegardsTo py;
// Within the actual function in the event handling code,
// one has to cast `ReplyTo* pp` from its base type to its actual type
// that has the rest of the data, which the event despatcher code knows nothing of.
// The event despatch code should not include the headers of the event handling code,
// as this would make possible breach of separation of responsibilities.
try{
(*((*pp).p))(pp, py, evtype, event );
}
catch(...){
// log error and release event handler.
// if an exception propagates from the event handling code into the event despatch code
// it is programming error, a violation of separation of responsibilities
// and the event despatch code cannot do anything with the error.
}
```
`pp` points into a blob that contains the data needed for handling the event when it happens,
and a pointer to the code that will handle it when it happens, ptrEvent is a ptr to a
struct containing the event.
But, that C code will be quite happy if given a class whose first field is a pointer to a C calling
convention static member function that calls the next field, the
next field being a lambda whose unnameable type is known to the templated object when
it was defined, or if it is given a class whose first field is a pointer to a C calling convention
static function that does any esoteric C++, or Rust, or Lua thing.
The runtime despatches an event to an object of type `ReplyTo` once and only
once and then it is freed. Thus if for example the object is waiting for a packet that has a handle
to it, or a timeout, and two such packets arrive it is only called with the
first such packet, the next packet is silently discarded, and the timeout
event cancelled or ignored.
The object of type RegardsTo has a longer lifetime, which the runtime
does not manage. The runtime ensures that if two events reference
the same RegardsTo object, they are handled serially, except for
the case that the RegardsTo object is a nullpointer.
The next event referencing the same RegardsTo object goes into a queue
waiting on completion of the previous message.
If an event has an InRegards to, but no InReply, it goes to the static default ReplyTo handler of the InRegards object, which gets called many times
and whose lifetime is not managed by the runtime.
If a message should result in changes to many InRegards to objects
one of them has to handle it, and then send messages to the others.
Code called by the async runtime must refrain from updating or
reading data that could be changed by other code called by the
asynch runtime unless the data is atomically changed from one
valid state to another valid state. (For example a pointer pointing
to the previous valid state is atomically updated to a pointer to a newly
created valid state.)
In the normal case the ReplyTo callback is sticking to data that
is in its ReplyTo and RegardsTo object.
When an RegardsTo object tells the runtime it has finalized,
then the runtime will no longer do callbacks referencing it.
The finalization is itself an event, and results in a callback to
another event data is a pointer to the finalized object and whose
ReplyTo and RegardsTo objects are the objects that created the
now finalized object.
Thus one RegardsTo object can spawn many such objects, and
can read their contents when they finalise. (But not until they
finalise)
## Use Go
Throw up hands in despair, and provide an interface linking Go to secure
Zooko ids, similar to the existing interface linking it to Quic and SSL.
This solution has the substantial advantage that it would then be relatively
easy to drop in the existing social networking software written in Go, such
as Gitea.
We probably dont want Go to start managing C++ spawned threads, but
the Go documentation seems to claim that when a Go heavyweight thread
gets stuck at a C mutex while executing C code, Go just spawns another to
deal with the lightweight threads when the lightweight threads start piling
up.
When a C++ thread wants to despatch an event to Go, it calls a Go routine
with a select and a default, so that the Go routine will never attempt to
pause the C++ spawned thread on the assumption that it is a Go spawned
thread. But it would likely be safer to call Goroutines on a thread that was
originally spawned by Go.
## doing it in C the C way
Processes represented as threads. Channels have a mutex. A thread grabs
total exclusive ownership of a channel whenever it takes something out or
puts something in. If a channel is empty or full, it then waits on a
condition on the mutex, and when the other thread grabs the mutex and
makes the channel ready, it notices that the other process or processes are
waiting on condition, the condition is now fulfilled, and sends a
notify_one.
Or, when the channel is neither empty nor full, we have an atomic spin lock,
and when sleeping might become necessary, then we go to full mutex resolution.
Which implies a whole pile of data global to all threads, which will have
to be atomically changed.
This can be done by giving each thread two buffers for this global data
subject to atomic operations, and single pointer or index that points to the
currently ruling global data set. (The mutex is also of course global, but
the flag saying whether to use atomics or mutex is located in a data
structure managed by atomics.)
When a thread wants to atomically update a large object (which should be
sixty four byte aligned) it constructs a copy of the current object, and
atomically updates the pointer to the copy, if the pointer was not changed
while it was constructing. The object is immutable while being pointed at.
Or we could have two such objects, with the thread spinning if one is in
use and the other already grabbed, or momentarily sleeping if an atomic
count indicates other threads are spinning on a switch awaiting
completion.
The read thread, having read, stores its read pointer atomically with
`memory_order_release`, ored with the flag saying if it is going to full
mutex resolution. It then reads the write pointer with
`memory_order_acquire`, that the write thread atomically wrote with
`memory_order_release`, and if all is well, keeps on reading, and if it is
blocked, or the write thread has gone to mutex resolution, sets its mutex
resolution flag and proceeds to mutex resolution. When it is coming out of
mutex resolution, about to release the mutex, it clears its mutex resolution
flag. The mutex is near the flags by memory location, all part of one object
that contains a mutex and atomic variables.
So the mutex flag is atomically set when the mutex has not yet been
acquired, but the thread is unconditionally going to acquire it, but non
atomically cleared when the mutex still belongs to the thread, but is
unconditionally going to release it.
If many read threads reading from one channel, then each thread has to
`memory_order_acquire` the read pointer, and then, instead of
`memory_order_release`ing it, has to do an
`atomic_compare_exchange_weak_explicit`, and if it changed while it was
reading abort its reads and start over.
Similarly if many write threads writing to one channel, each write thread
will have first spin lock acquire the privilege of being the sole write thread
writing, or spin lock acquire a range to write to. Thus in the most general
case, we have a spin locked atomic write state that specifies an area that
has been written to, an area that is being written to, and an area that is
available to be acquired for writing, a spin locked atomic read state, and
mutex that holds both the write state and the read state. In the case of a
vector buffer with multiple writers, the atomic states are three wrapping
atomic pointers that go through the buffer in the same direction,
We would like to use direct memory addresses, rather than vector or deque
addresses, which might require us to write our own vector or deque. See
the [thread safe deque](https://codereview.stackexchange.com/questions/238347/a-simple-thread-safe-deque-in-c "A simple thread-safe Deque in C++"), which however relies entirely on locks and mutexes,
and whose extension to atomic locks is not obvious.
Suppose you are doing atomic operations, but some operations might be
expensive and lengthy. You really only want to spin lock on amending data
that is small and all in close together in memory, so on your second spin,
the lock has likely been released.
Well, if you might need to sleep a thread, you need a regular mutex, but
how are you going to interface spin locks and regular mutexes?
You could cleverly do it with notifies, but I suspect it is costly compared
to just using a plain old vanilla mutex. Instead you have some data
protected by atomic locks, and some data protected by regular old
mutexes, and any time the data protected by the regular old mutex might
change, you atomically flag a change coming up, and every thread then
grabs the mutex in order to look amend or even look at the data, until on
coming out of the mutex with the data, they see the flag saying the mutex
protected data might change is now clear.
After one has flagged the change coming up, and grabbed the mutex, wha
happens if another thread is cheerfully amending the data in a fast
operation, having started before you grabbed the mutex? The other thread
has to be able to back out of that, and then try again, this try likely to be
with mutex resolution. But what if the other thread wants to write into a
great big vector, and reallocations of the vector are mutex protected. And
we want atomic operations so that not everyone has to grab the mutex every
time.
Well, any time you want to do something to the vector, it fits or it does not.
And if it does not fit, then mutex time. You want all threads to switch
to mutex resolution, before any thread actually goes to work reallocating
the vector. So you are going to have to use the costly notify pattern. “I am
out of space, so going to sleep until I can use the mutex to amend the
vector. Wake me up when last thread using atomics has stopped using
atomics that directly reference memory, and has switched to reading the
mutex protected data, so that I can change the mutex protected data.”
The std::vector documentation says that vector access is just as efficient as
array access, but I am a little puzzled by this claim, as a vector can be
moved, and specifically requests that you have a no throw move operation for
optimization, and having a no copy is standard where it contains things that
might have ownership. (Which leads to complications when one has containers
of containers, since C++ is apt to helpfully generate a broken copy
implementation.)
Which would suggest that vector access is through indirection, and
indirects with threading create problems.
## lightweight threads in C
A lightweight thread is just a thread where, whenever a lightweight thread
needs to be paused by its heavyweight thread, the heavyweight thread
stores the current stack state in the heap, and move on to deal with other
lightweight threads that need to be taken care of. Which collection of
preserved lightweight thread stack states amount to a pile of event
handlers that are awaiting events, and having received events, are then
waiting for a heavyweight thread to process that event handler.
Thus one winds up with what suspect it the Tokio solution, a stack that
is a tree, rather than a stack.
Hence the equivalence between node.js and nginx event oriented
programming, and Go concurrent programming.
# costs
Windows 10 is limited to sixty four threads total. If you attempt to create
more threads than that, it still works, but performance is apt to bite, with
arbitrary and artificial thread blocking. Hence goroutines, that implement
unofficial threads inside the official threads.
Thread creation and destruction is fast, five to twenty microseconds, so
thread pools do not buy you much, except that your memory is already
going to be cached. Another source says 40 microseconds on windows,
and fifty kilobytes per thread. So, a gigabyte of ram could have twenty
thousand threads hanging around. Except that the windows thread
scheduler dies on its ass.
There is a reasonable discussion of thread costs [here](https://news.ycombinator.com/item?id=22456642)
General message is that lots of languages have done it better, often
immensely better, Go among them.
Checking the C++ threading libraries, they all single mindedly focus on
the particular goal of parallelizing computationally intensive work. Which
is not in fact terribly useful for anything you are interested in doing.
# Atomics
```C++
typedef enum memory_order {
memory_order_relaxed, // relaxed
memory_order_consume, // consume
/* No one, least of all compiler writers, understands what
"consume" does.
It has consequences which are difficult to understand or predict,
and which are apt to be inconsistent between architectures,
libraries, and compilers. */
memory_order_acquire, // acquire
memory_order_release, // release
memory_order_acq_rel, // acquire/release
memory_order_seq_cst // sequentially consistent
/* "sequentially consistent" interacts with the more commonly\
used acquire and release in ways difficult to understand or
predict, and in ways that compiler and library writers
disagree on. */
} memory_order;
```
I dont think I understand how to use atomics correctly.
`Atomic_compare_exchange_weak_explicit` inside a while loop is
a spin lock, and spin locks are complicated, apt to be inefficient,
potentially catastrophic, and avoiding catastrophe is subtle and complex.
To cleanly express a concurrent algorithm you need a thousand
communicating processes, as goroutines or node.js continuations, nearly
all of which are sitting around waiting for the another thing to send them
a message or be ready to receive their message, while atomics give you a
fixed small number of threads all barreling full speed ahead. Whereupon
you find yourself using spin locks.
Rather than moving data between threads, you need to move threads between
data, between one continuation and the next.
Well, if you have a process that interacts with Sqlite, each thread has to
have its own database connection, in which case it needs to be a pool of
threads maybe you have a pool of database threads that do work received
from a bunch of asynch tasks through a single fixed sized fifo queue, and
send the results back through another fifo queue, with threads waking up
when the queue gets more stuff in it, and going to sleep when the queue
empties, with the last thread signalling “wake me up when there is
something to do”, and pushback happening when buffer is full.
Go demonstrates that you can cleanly express algorithms as concurrent
communicating processes using fixed size channels. An unbuffered
channel is just a coprocess, with a single thread of execution switching
between the two coprocesses, without any need for locks or atomics, but
with a need for stack fixups. But Node.js seems to get by fine with code
continuations instead of Gos stack fixups.
A buffered channel is just a fixed size block of memory with alignment,
size, and atomic wrapping read and write pointers.
Why do they need to be atomic?
So that the read thread can acquire the write pointer to see how much data
is available, and release the read pointer so that the write thread can
acquire the read pointer to see how much space is available, and
conversely the write thread acquires the read pointer and releases the write
pointer.And when write thread updates the write pointer it updates it *after*
writing the data and does a release on the write pointer atomic, so that
when the read thread does an acquire on the write pointer, all the data that
the write pointer says was written will actually be there in the memory that
read thread is looking at.
Multiple routines can send data into a single channel, and, with select, a
single channel can receive data from any channels.
But, with go style programming, you are apt to have far more routines
than actual hardware threads servicing them, so you are still going to need
to sleep your threads, making atomic channels an optimization of limited
value.
Your input buffer is empty. If you have one thread handling the one
process for that input stream, going to have to sleep it. But this is costly.
Better to have continuations that get executed when data is available in the
channel, which means your channels are all piping to one thread, that then
calls the appropriate code continuation. So how is one thread going to do a
select on a thousand channels?
Well, we have a channel full of channels that need to be serviced. And
when that channel empties, mutex.
Trouble is, I have not figured out how to have a thread wait on multiple
channels. The C++ wait function does not implement a select. Well, it
does, but you need a condition statement that looks over all the possible
wake conditions. And it looks like all those wake conditions have to be on
a single mutex, on which there is likely to be a lot of contention.
It seems that every thread grabs the lock, modifies the data protected by
the lock, performs waits on potentially many condition variables all using
the same lock and protected by the same lock, condition variables that
look at conditions protected by the lock, then releases the lock
immediately after firing the notify.
But it could happen that if we try to avoid unnecessarily grabbing the
mutex, one thread sees the other thread awake, just when it is going to
sleep, so I fear I have missed a spin lock somewhere in this story.
If we want to avoid unnecessary resort to mutex, we have to spin lock on a
state machine that governs entry into mutex resolution. Each thread makes
its decision based on the current state of channel and state machine, an
does a `Atomic_compare_exchange_weak_explicit` to amend the state of the
state machine. If the state machine has not changed, the decision goes
through. If the state machine was changed, presumably by the other thread,
it re-evaluates its decision and tries again.
Condition variables are designed to support the case where you have one
thread or a potentially vast pool of threads waiting for work, but are not
really designed to address the case where one thread is waiting for work
from a potentially vast pool of threads, and I rather think I will have to
handcraft a handler for this case from atomics and, ugh, dangerous spin
loops implemented in atomics.
A zero capacity Go channel sort of corresponds to a C++ binary
semaphore. A finite and small Go channel sort of corresponds to C++
finite and small semaphore. Maybe the solution is semaphores, rather than
atomic variables. But I am just not seeing a match.
I notice that notifications seems to be built out of a critical section, with
lots of grabbing a mutex and releasing a mutex, with far too much
grabbing a mutex and releasing a mutex. Under the hood, likely a too-clever
and complicated use of threads piling up on the same critical
section. So maybe we need some spin state atomic state machine system
that drops spinning threads to wait on a semaphore. Each thread on a
channel drops the most recent state channel after reading, and most recent
state after writing, onto an atomic variable.
But the most general case is many to many, with many processes doing a
select on many channels. We want a thread to sleep if all the channels on
which it is doing a select are blocked on the operation it wants to do, and
we want processes waiting on a channel to keep being woken up, one at a
time, as long a channel has stuff that processes are waiting on.
# C++ Multithreading
`std:aysnc` is designed to support the case where threads spawn more
threads if there is more work to do, and the pool of threads is not too large,
and threads terminate when they are out of work, or do the work
sequentially if doing it in parallel seems unlikely do yield benefits. C++ by
default manages the decision for you.
Maybe the solution is to use threads where we need stack state, and
continuations serviced by a single thread where we expect to handle one
and only one reply. Node.js gets by fine on one thread and one database
connection.
```C++
#include &t;thread>
static_assert(__STDCPP_THREADS__==1, "Needs threads");
// As thread resources have to be managed, need to be wrapped in
// RAII
class ThreadRAII {
std::thread & m_thread;
public:
// As a thread object is moveable but not copyable, the thread obj
// needs to be constructed inside the invocation of the ThreadRAII
// constructor. */
ThreadRAII(std::thread & threadObj) : m_thread(threadObj){}
~ThreadRAII(){
// Check if thread is joinable then detach the thread
if(m_thread.joinable()){
m_thread.detach();
}
}
};
```
Examples of thread construction
```C++
void foo(char *){
}
class foo_functor
{
public:
void operator()(char *){
}
};
int main(){
ThreadRAII thread_one(std::thread (foo, "one"));
ThreadRAII thread_two(
std::thread (
(foo_functor()),
"two"
)
);
const char three[]{"three"};
ThreadRAII thread_lambda(
std::thread(
[three](){
}
)
);
}
```
C++ has a bunch of threading facilities that are designed for the case that
a normal procedural program forks a bunch of tasks to do stuff in parallel,
and then when they are all done, merges the results with join or promise
and future, and then the main program does its thing.
This is not so useful when the main program is a event oriented, rather
than procedural.
If the main program is event oriented, then each thread has to stick around
for the duration, and has to have its own event queue, which C++ does not
directly provide.
In this case threads communicate by posting events, and primitives that do
thread synchronization (promise, future, join) are not terribly useful.
A thread grabs its event queue, using the mutex, pops out the next event,
releases the mutex, and does its thing.
If the event queue is empty, then, without releasing it, the thread
processing events waits on a [condition variable](https://thispointer.com//c11-multithreading-part-7-condition-variables-explained/). (which wait releases the
mutex). When another thread grabs the event queue mutex and stuffs
something into into the event queue, it fires the [condition variable](https://thispointer.com//c11-multithreading-part-7-condition-variables-explained/), which
wakes up and restores the mutex of the thread that will process the event
queue.
Mutexes need to construct RAII objects, one of which we will use in
constructing the condition object.