There is no reason to burden one side with Python whole the other side is C when they could have just as easily perform an apples-to-apples comparison where both sides are written in C, one calling a BLAS library while the other calls this other implementation.
This looks like a nice write-up and implementation. I'm left wondering what is the "trick"? How does it manage to beat OpenBLAS, which is assembly+C optimized over decades for this exact problem? It goes into detail about caching, etc - is BLAS is not taking advantage of these things, or is this more tuned to this specific processor, etc?
- OpenBLAS isn't _that_ optimized for any specific modern architecture.
- The matrices weren't that big. Numpy has cffi overhead.
- The perf difference was much more noticeable with _peak_ throughput rather than _mean_ throughput, which matters for almost no applications (a few, admittedly, but even where "peak" is close to the right measure you usually want something like the mean of the top-k results or the proportion with under some latency, ...).
- The benchmarking code they displayed runs through Python's allocator for numpy and is suggestive of not going through any allocator for the C implementation. Everything might be fine, but that'd be the first place I checked for microbenchmarking errors or discrepancies (most numpy routines allow in-place operations; given that that's known to be a bottleneck in some applications of numpy, I'd be tempted to explicitly examine benchmarks for in-place versions of both).
- Numpy has some bounds checking and error handling code which runs regardless of the underlying implementation. That's part of why it's so bleedingly slow for small matrices compared to even vanilla Python lists (they tested bigger matrices too, so this isn't the only effect, but I'll mention it anyway). It's hard to make something faster when you add a few thousand cycles of pure overhead.
- This was a very principled approach to saturating the relevant caches. It's "obvious" in some sense, but clear engineering improvements are worth highlighting in discussions like this, in the sense that OpenBLAS, even with many man-hours, likely hasn't thought of everything.
And so on. Anyone can rattle off differences. A proper explanation requires an actual deep-dive into both chunks of code.
Maybe -march=native gives it an edge as it compiles for this exact CPU model whereas numpy is compiled for a more generic (older) x86-64. -march=native would probably get v4 on a Ryzen CPU where numpy is probably targeting v1 or v2.
np.matmul just uses whatever blas library your NumPy distribution was configured for/shipped with.
Could be MKL (i believe the conda version comes with it) but it could also be an ancient version of OpenBLAS you already had installed. So yeah, being faster than np.matmul probably just means your NumPy is not installed optimally.
Most common coding patterns leave a lot of performance unclaimed, by not fully specializing to the hardware. This article is an interesting example. For another interesting demonstration, see this CS classic "There's plenty of room at the top"
Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack?
Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.
Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).
Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.
Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.
"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!
We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY
I've only skimmed so far, but this post has a lot of detail and explanation. Looks like a pretty great description of how fast matrix multiplications are implemented to take into account architectural concerns. Goes on my reading list!
I don’t save articles often, but maybe one time in a few months I see something I know I will enjoy reading again even after 1 or 2 years. Keep up the great work OP!
Very nice write up. Those are the kind of matrix sizes that MKL is fairly good at, might be worth a comparison as well?
Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.
> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.
I think I'll need a TL;DR on what to change all these values to.
I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.
So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).
I don't recall the link but there was a github repo with comparisons of CFFI implementations in different languages, and from what i remember Python was 'bout 3 orders of magnitude slower than, say, Lua or Ocaml
- The matrices weren't that big. Numpy has cffi overhead.
- The perf difference was much more noticeable with _peak_ throughput rather than _mean_ throughput, which matters for almost no applications (a few, admittedly, but even where "peak" is close to the right measure you usually want something like the mean of the top-k results or the proportion with under some latency, ...).
- The benchmarking code they displayed runs through Python's allocator for numpy and is suggestive of not going through any allocator for the C implementation. Everything might be fine, but that'd be the first place I checked for microbenchmarking errors or discrepancies (most numpy routines allow in-place operations; given that that's known to be a bottleneck in some applications of numpy, I'd be tempted to explicitly examine benchmarks for in-place versions of both).
- Numpy has some bounds checking and error handling code which runs regardless of the underlying implementation. That's part of why it's so bleedingly slow for small matrices compared to even vanilla Python lists (they tested bigger matrices too, so this isn't the only effect, but I'll mention it anyway). It's hard to make something faster when you add a few thousand cycles of pure overhead.
- This was a very principled approach to saturating the relevant caches. It's "obvious" in some sense, but clear engineering improvements are worth highlighting in discussions like this, in the sense that OpenBLAS, even with many man-hours, likely hasn't thought of everything.
And so on. Anyone can rattle off differences. A proper explanation requires an actual deep-dive into both chunks of code.
https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...
E.g. https://github.com/numpy/numpy/blob/main/numpy/_core/src/com...
Could be MKL (i believe the conda version comes with it) but it could also be an ancient version of OpenBLAS you already had installed. So yeah, being faster than np.matmul probably just means your NumPy is not installed optimally.
https://www.science.org/doi/10.1126/science.aam9744
Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).
[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...
Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.
"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!
https://hacks.mozilla.org/2024/04/llamafiles-progress-four-m...
also https://justine.lol/matmul/
Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.
> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.
I think I'll need a TL;DR on what to change all these values to.
I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.
So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).
If you want to convert to Python lists its is going to take time. Not sure about Python arrays.
Edit: ha, found it https://news.ycombinator.com/from?site=github.com/dyu
Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.