The code examples are confusing. The show the code that takes the locks, but they don’t show any of the data structures involved. The rwlock variant clones the Arc (makes sense), but the mutex variant does not (is it hidden inside inner.get)?
In any case, optimizing this well would require a lot more knowledge of what’s going on under the hood. What are the keys? Can the entire map be split into several maps? Can a reader hold the rwlock across multiple lookups? Is a data structure using something like RCU an option?
Does this apply also to std::shared_mutex in C++? This is a timely article if so; I’m in the middle of doing some C++ multithreading that relies on a shared_mutex. I have some measuring to do.
This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).
There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.
I think it’s not unusual that reader-writer locks, even if well implemented, get in places where there are so many readers stacked up that writers never get to get a turn or 1 writer winds up holding up N readers which is not so scalable as you increase N.
If implementation is task based and task always runs on same virtual CPU (slots equaling CPUs or parallelism), wonder if something like below might help.
RW lock could be implemented using an array of length equal to slots and proper padding to ensure each slot is in its own face line (avoid invalidating CPU cache when different slot is read/written).
For read lock: Each task acquires the lock for their slot.
For write lock: Acquire lock from left most slot to right. Writes can starve readers when they block on in-flight reader at a different slot when moving from left to right.
I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?
I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing.
We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes).
the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them).
A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable.
Take a look at crates like arc_swap if you have a read often write rarely lock case. You can easily implement the RCU pattern. Just be sure to read about how to use RCU properly.
Well done this pattern gives you nearly free reads and cheap writes, sometimes cheaper than a lock.
For frequent writes a good RWLock is often better since RCU can degrade rapidly and badly under write contention.
The code examples are confusing. The show the code that takes the locks, but they don’t show any of the data structures involved. The rwlock variant clones the Arc (makes sense), but the mutex variant does not (is it hidden inside inner.get)?
In any case, optimizing this well would require a lot more knowledge of what’s going on under the hood. What are the keys? Can the entire map be split into several maps? Can a reader hold the rwlock across multiple lookups? Is a data structure using something like RCU an option?
Does this apply also to std::shared_mutex in C++? This is a timely article if so; I’m in the middle of doing some C++ multithreading that relies on a shared_mutex. I have some measuring to do.
This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).
One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/
There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.
I think it’s not unusual that reader-writer locks, even if well implemented, get in places where there are so many readers stacked up that writers never get to get a turn or 1 writer winds up holding up N readers which is not so scalable as you increase N.
Right, and if you're on the JVM you have access to things like ConcurrentHashMap which is lock free.
And a Rust equivalent of folly::SharedMutex: https://docs.rs/crossbeam-utils/latest/crossbeam_utils/sync/...
If implementation is task based and task always runs on same virtual CPU (slots equaling CPUs or parallelism), wonder if something like below might help.
RW lock could be implemented using an array of length equal to slots and proper padding to ensure each slot is in its own face line (avoid invalidating CPU cache when different slot is read/written).
For read lock: Each task acquires the lock for their slot.
For write lock: Acquire lock from left most slot to right. Writes can starve readers when they block on in-flight reader at a different slot when moving from left to right.
I do not know how Rust RW locks are implemented.
I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?
I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing.
We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes).
Rust has an interesting crate for this, arc-swap [1].
It's essentially just an atomic pointer that can be swapped out.
[1] https://docs.rs/arc-swap/latest/arc_swap/
Read lock requires communication between cores. It just can't scale with CPU count
the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them).
A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable.
claudes love to talk about The Hardware Reality
"The performance Death Spiral" was the point I realised I was being LLMd and bailed out.
Take a look at crates like arc_swap if you have a read often write rarely lock case. You can easily implement the RCU pattern. Just be sure to read about how to use RCU properly.
Well done this pattern gives you nearly free reads and cheap writes, sometimes cheaper than a lock.
For frequent writes a good RWLock is often better since RCU can degrade rapidly and badly under write contention.