> In CPU world there is a desire to shield programmers from those low-level details, but I think there are two interesting forces at play now-a-days that’ll change it soon. On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.
The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.
I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.
The good news is that the portable subset of SIMD is all you really need anyway. If you go beyond the portable subset, you need per-architecture code writing and testing, and you're mostly talking about pretty small gains relative to the cost.
> The answer, if it’s not obvious from my tone already:), is 8%.
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.
What I get in these article is that the original intent on C language stands true.
Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).
Complex compiler doing crazy optimizations, in my opinion, is not worth it.
Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.
Would be interesting to see if auto vec performs better with that addition.
which honestly, shouldn't be neccessary today with avx512. There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput, while if it_is_ aligned you will get the same performance as the aligned-only load.
No reason for the compiler to balk at vectorizing unaligned data these days.
> There's essentially no reason to prefer the aligned load/store commands over the unaligned ones - if the actual pointer is unaligned it will function correctly at half the throughput
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...
The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.
I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).
Complex compiler doing crazy optimizations, in my opinion, is not worth it.
See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...
Would be interesting to see if auto vec performs better with that addition.
auto aligned_p = std::assume_aligned<16>(p)
No reason for the compiler to balk at vectorizing unaligned data these days.
Getting a fault instead of half the performance is actually a really good reason to prefer aligned load/store. To be fair, you're talking about a compiler here, but I never understood why people use the unaligned intrinsics...