Beating NumPy matrix multiplication in 150 lines of C

(salykova.github.io)

212 points | by p1esk 4 days ago

12 comments

david-gpu 3 days ago
There is no reason to burden one side with Python whole the other side is C when they could have just as easily perform an apples-to-apples comparison where both sides are written in C, one calling a BLAS library while the other calls this other implementation.
ks2048 4 days ago
This looks like a nice write-up and implementation. I'm left wondering what is the "trick"? How does it manage to beat OpenBLAS, which is assembly+C optimized over decades for this exact problem? It goes into detail about caching, etc - is BLAS is not taking advantage of these things, or is this more tuned to this specific processor, etc?
[-]
- hansvm 3 days ago
  - OpenBLAS isn't _that_ optimized for any specific modern architecture.
  - The matrices weren't that big. Numpy has cffi overhead.
  - The perf difference was much more noticeable with _peak_ throughput rather than _mean_ throughput, which matters for almost no applications (a few, admittedly, but even where "peak" is close to the right measure you usually want something like the mean of the top-k results or the proportion with under some latency, ...).
  - The benchmarking code they displayed runs through Python's allocator for numpy and is suggestive of not going through any allocator for the C implementation. Everything might be fine, but that'd be the first place I checked for microbenchmarking errors or discrepancies (most numpy routines allow in-place operations; given that that's known to be a bottleneck in some applications of numpy, I'd be tempted to explicitly examine benchmarks for in-place versions of both).
  - Numpy has some bounds checking and error handling code which runs regardless of the underlying implementation. That's part of why it's so bleedingly slow for small matrices compared to even vanilla Python lists (they tested bigger matrices too, so this isn't the only effect, but I'll mention it anyway). It's hard to make something faster when you add a few thousand cycles of pure overhead.
  - This was a very principled approach to saturating the relevant caches. It's "obvious" in some sense, but clear engineering improvements are worth highlighting in discussions like this, in the sense that OpenBLAS, even with many man-hours, likely hasn't thought of everything.
  And so on. Anyone can rattle off differences. A proper explanation requires an actual deep-dive into both chunks of code.
  [-]
  - robxorb 3 days ago
    To your third point - it looks as if the lines of mean values were averaged, this posts code would still be a clear winner.
- ipsum2 3 days ago
  Comparison with numpy 2.0 should be better for numpy because it integrates Google highway for better simd across different microarchitectures.
- sbstp 4 days ago
  Maybe -march=native gives it an edge as it compiles for this exact CPU model whereas numpy is compiled for a more generic (older) x86-64. -march=native would probably get v4 on a Ryzen CPU where numpy is probably targeting v1 or v2.
  https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...
  [-]
  - stingraycharles 3 days ago
    Doesn’t numpy have runtime SIMD dispatching and whatnot based on CPU flags?
    E.g. https://github.com/numpy/numpy/blob/main/numpy/_core/src/com...
    [-]
    - KeplerBoy 3 days ago
      np.matmul just uses whatever blas library your NumPy distribution was configured for/shipped with.
      Could be MKL (i believe the conda version comes with it) but it could also be an ancient version of OpenBLAS you already had installed. So yeah, being faster than np.matmul probably just means your NumPy is not installed optimally.
ssivark 4 days ago
Most common coding patterns leave a lot of performance unclaimed, by not fully specializing to the hardware. This article is an interesting example. For another interesting demonstration, see this CS classic "There's plenty of room at the top"
https://www.science.org/doi/10.1126/science.aam9744
[-]
- hmaarrfk 3 days ago
  Thanks for sharing. That was a great read
teo_zero 3 days ago
Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack?
[-]
- rustybolt 3 days ago
  Yes
dzaima 4 days ago
Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.
Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).
[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...
[-]
- salykova 3 days ago
  Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.
  Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.
  "Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!
leeoniya 3 days ago
what about jart's tinyBLAS?
https://hacks.mozilla.org/2024/04/llamafiles-progress-four-m...
also https://justine.lol/matmul/
[-]
- salykova 3 days ago
  We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY
  [-]
  - salykova 3 days ago
    "off-topic" channel
azornathogron 4 days ago
I've only skimmed so far, but this post has a lot of detail and explanation. Looks like a pretty great description of how fast matrix multiplications are implemented to take into account architectural concerns. Goes on my reading list!
[-]
- ghghgfdfgh 4 days ago
  I don’t save articles often, but maybe one time in a few months I see something I know I will enjoy reading again even after 1 or 2 years. Keep up the great work OP!
marmaduke 3 days ago
Very nice write up. Those are the kind of matrix sizes that MKL is fairly good at, might be worth a comparison as well?
Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.
SushiHippie 3 days ago
In the README, it says:
> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.
I think I'll need a TL;DR on what to change all these values to.
I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.
So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).
jstrong 4 days ago
in terms of comparing to numpy, how much overhead would there be from Python (vs. running the numpy C code alone)?
[-]
- SJC_Hacker 4 days ago
  Python can efficiently call C libs if you use ctypes and native pointers, which numpy uses. Of course depends on expected layout.
  If you want to convert to Python lists its is going to take time. Not sure about Python arrays.
  [-]
  - dgan 3 days ago
    I don't recall the link but there was a github repo with comparisons of CFFI implementations in different languages, and from what i remember Python was 'bout 3 orders of magnitude slower than, say, Lua or Ocaml
    Edit: ha, found it https://news.ycombinator.com/from?site=github.com/dyu
  - brnt 3 days ago
    If you use numpy, then you use an ndarray, which you can create from a C array for 'free' (no copying, just set a pointer).
KerrAvon 4 days ago
The article claims this is portable C. Given the use of intel intrinsics, what happens if you try to compile it for ARM64?
[-]
- tempay 4 days ago
  I think they mean portable between modern x86 CPUs as opposed to truly portable.
  [-]
- rurban 3 days ago
  You'd need first to convert it to portable SIMD intrinsics. There are several libraries.
le-mark 4 days ago
> This is my first time writing a blog post. If you enjoy it, please subscribe and share it!
Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.