Looks like a good write up, but I'd caution that some of the statements about memory models aren't completely accurate.
The terms relaxed, acquire, and release refer to how an atomic operation is ordered against other accesses to memory.
Counter to what the article states, a relaxed atomic is still atomic, meaning that it cannot tear and, for RMW atomic, no other access can go between the read and the write. But a relaxed atomic does not order other accesses, which can lead to unintuitive outcomes.
By contrast, once you've observed another thread's release store with an acquire load, you're guaranteed that your subsequent memory accesses "happen after" all of the other thread's accesses from before that release store -- which is what you'd intuitively expect, it's just that in modern systems (which are really highly distributed systems even on a single chip) there's a cost to establishing this kind of guarantee, which is why you can opt out of it with relaxed atomics if you know what you're doing.
Rather than incrementing each counter by one, dither the counters to reduce cache conflicts? So what if the dequeue becomes a bit fuzzy. Make the queue a bit longer, so everyone survives at least as long as they would have survived before.
Or simply use a prime length queue, and change what +1 means, so one's stride is longer than the cache conflict concern. Any stride will generate the full cyclic group, for a prime.
My understanding is that there is no global ordering anyway: allocation of a node has a total order, but actually writing data to a node can be arbitrarily delayed. At this point might as well use separate queues for each writer.
edit: the author claims the algorithm is linearizable, so I'll keep reading the article.
Yes, I meant to clarify the memory model discussion; I had tried to simplify and did a poor job; I got similar feedback after it was published, and never remembered to get to it. Will try to do it soon, though it's about the worst time for this to have hit, not sure when I'll be able to sit down for it, but will try to get it done in the next day. Hopefully it doesn't wait until next time it gets some views.
I also note that you don't mention the cache-line contention issue when accessing atomics in a multi-threaded context. That's a huge performance issue with lock-free constructs.
A good write up, but a preallocated MPMC queue can be built without using CAS on every push and without fixed size slots.
While even SPSC ring buffers are simple but not particularly efficient in the case where the consumer is keeping up with the producer (the ideal case for a queue) due to all the cacheline ping pong. They are not the lowest latency solution
Why would you want an MPMC queue primitive, instead of just using MPSC per consumer? I don't really see a discussion of the cache contention issues. (There is some mention of contention, but it seems the author is using the word in a different way.)
It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.
I wrote my first SPSC circular buffer implementation in TS upon reading the previous post on futex [0] from the same author. It was more intricate than it had seemed, until I finally sat down and wrote my own code.
I recently started playing with these in game design as well to coordinate networking and game threads in a lock free manner. If it fits your use case it is truly a free lunch! They definitely have a lot of edge cases and are not easy to implement, but they aren't too crazy either.
By a quick glance, yes, this is what I want: a channel to communicate between processes via a piece of shared memory, protected by a pair of futexes.
In JS ecosystem, buffers that allow data loss is more common (aka ring buffers), but ringbuf.js [1] is the only complete implementation to my knowledge. In my use case on I/O between WASM modules where data must be transferred as-is, the process must block on buffer overrun/underrun in a synchronous manner. Therefore a circular buffer is required. I could not find such a niche library written in JS, so I decided to bite the bullet and reinvent the wheel [2].
I don't think I even saw this until I published the article. I don't think it was academically published, or googled well back in 2000. Nor did it match my needs for a ring buffer at the time, which was to drop stale data (I think in that algorithm, write operations fail when the buffer is full), so if I did see it, I wouldn't have payed enough attention to notice if it even had users. It's good work for sure, but that's why it didn't get mentioned.
Eh, I get it. I also independently came up with these MPMC approaches before hearing about LMAX. They're not entirely trivial but they do follow fairly naturally from the problem space when you really think about it. It's a good piece of engineering, but the one thing that's really unique about LMAX is the amount of advertising it gets.
> the one thing that's really unique about LMAX is the amount of advertising it gets.
Virtually zero? I have to go out of my way to remind HN it exists while everyone is 300+ comments deep into reinventing the wheel on a ~quarterly basis.
Seems hung up on defining a ring buffer as something non-blocking (dropping). Having used ring buffers in both software and hardware systems for 40 years, we always called them ring buffers without this distinction. We would have called them all ring buffers. One nice thing about them is that it’s very easy to pass buffers back and forth between a single producer and single consumer without any locks or fancy memory atomics. All you need is a guarantee that memory writes occur in order. The AMD LANCE Ethernet controller (and many later derivatives) used this scheme to allow smooth overlapping of software frame processing with hardware transmission and reception way back in the 1980s.
This isn't that new. (see: FASTER: A Concurrent Key-Value Store with In-Place Updates. 2018 ACM SIGMOD International Conference on Management of Data and related papers)
However, this is well written and very easy to read.
Well, when I was doing the original work on it (about 5 years ago now), I spent a lot of time trying to find something else in the literature. I couldn't find anything that wasn't SPMC or MPSC, unless it had severe limitations, like not actually having a drop policy when full.
However, I definitely did not see the paper you've sited, but just spent a few minutes with the paper you cited. Section 5.2 seems to cover their version. It's far from clear what their algorithm is, but they are talking about page eviction; it doesn't seem like they're using even a single fixed array for the ring, but I'm not 100% sure because it's pretty light on any detail. a
I hope you won't mind my picking your brain: which MPSC ring buffer implementations did you find that does drop old items when full? I could only found implementations that are basically re-purposed MPMC, or that cannot deal with non-POD data (seqlock-based).
So that work came after mine, and seems to be a FIFO not a ring buffer. The library I built at the time also had FIFOs and LIFOs that were tweaks on previous algorithms, but nothing earth shaking. I'll check this one out when I can though.
You can use a 64-bit CAS if you want to use a 32-bit epoch and a pointer compression scheme of any kind, or just a 32-bit index into regions that are thread specific. I think I did the later when I did the original work, using the core primitive to build ring buffers that have arbitrary sized slots instead of 64-bit slots (which requires a bit of additional gymnastics, but the basic trick is to have the ring index into a bigger ring that you can FAA into, where the bigger ring has more slots by at least the max number of threads (I use this primitive heavily still for in-memory debug logging). Maybe at some point I'll do an article on that too.
BTW, should be noted that the need to issue a cache line lock on x86 does seem to slow down 128-bit CAS quite a bit on x86-64 platforms. On arm64, there's no reason to skimp with a 64-bit CAS.
The terms relaxed, acquire, and release refer to how an atomic operation is ordered against other accesses to memory.
Counter to what the article states, a relaxed atomic is still atomic, meaning that it cannot tear and, for RMW atomic, no other access can go between the read and the write. But a relaxed atomic does not order other accesses, which can lead to unintuitive outcomes.
By contrast, once you've observed another thread's release store with an acquire load, you're guaranteed that your subsequent memory accesses "happen after" all of the other thread's accesses from before that release store -- which is what you'd intuitively expect, it's just that in modern systems (which are really highly distributed systems even on a single chip) there's a cost to establishing this kind of guarantee, which is why you can opt out of it with relaxed atomics if you know what you're doing.
https://en.wikipedia.org/wiki/False_sharing
Rather than incrementing each counter by one, dither the counters to reduce cache conflicts? So what if the dequeue becomes a bit fuzzy. Make the queue a bit longer, so everyone survives at least as long as they would have survived before.
Or simply use a prime length queue, and change what +1 means, so one's stride is longer than the cache conflict concern. Any stride will generate the full cyclic group, for a prime.
edit: the author claims the algorithm is linearizable, so I'll keep reading the article.
<3
While even SPSC ring buffers are simple but not particularly efficient in the case where the consumer is keeping up with the producer (the ideal case for a queue) due to all the cacheline ping pong. They are not the lowest latency solution
It looks like both enqueue and dequeue are RMWing the "epochs" variable? This can pretty easily lead to something like O(N^2) runtime if N processors are banging this cache line at the same time.
[0]: https://h4x0r.org/futex/ discussion: https://news.ycombinator.com/item?id=44951563
maybe that is what you want.
In JS ecosystem, buffers that allow data loss is more common (aka ring buffers), but ringbuf.js [1] is the only complete implementation to my knowledge. In my use case on I/O between WASM modules where data must be transferred as-is, the process must block on buffer overrun/underrun in a synchronous manner. Therefore a circular buffer is required. I could not find such a niche library written in JS, so I decided to bite the bullet and reinvent the wheel [2].
[1]: https://github.com/padenot/ringbuf.js
[2]: https://github.com/andy0130tw/spsc
Virtually zero? I have to go out of my way to remind HN it exists while everyone is 300+ comments deep into reinventing the wheel on a ~quarterly basis.
However, this is well written and very easy to read.
However, I definitely did not see the paper you've sited, but just spent a few minutes with the paper you cited. Section 5.2 seems to cover their version. It's far from clear what their algorithm is, but they are talking about page eviction; it doesn't seem like they're using even a single fixed array for the ring, but I'm not 100% sure because it's pretty light on any detail. a
Which is based on: https://ieeexplore.ieee.org/document/9490347