How much do amd64 microarchitecture levels help in Go?

(lemire.me)

63 points | by zdw 1 day ago

13 comments

seddonm1 8 hours ago
I would be interested to know if there is a method similar to this one in Rust [0] that allows a single binary to support multiple optimization levels depending on the executing CPU? It feels wasteful to not enable these optimizations but I don't really want to force a user to choose between a complex feature matrix.
[0] https://github.com/ronnychevalier/cargo-multivers
[-]
- SkiFire13 6 hours ago
  Note that you can do this yourself manually for single pieces of code by using `is_x86_feature_detected` [0]
  [0] https://doc.rust-lang.org/stable/std/macro.is_x86_feature_de...
- mikepurvis 7 hours ago
  I've long been surprised there isn't more multiversion stuff built right into every language compiler; I would have thought Intel would be very motivated to get more binaries lighting up the features they add to their expensive top line CPUs.
  But yeah no, on the whole cost of the checks and duplicated binary size aren't seen as worth it, so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
  [-]
  - Someone 7 hours ago
    > so instead it's piecemeal implementations mostly in numeric packages like eigen and lapack.
    Because that’s where the user-noticeable gains can be made. Using popcount in code you run once is going to shave off, maybe, 100 cycles. That isn’t worth the extra cycles of that approach.
    Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.
    Finally, where it really matters, it’s not only a matter of recompiling the same code. For optimal performance, you also want to change loop unrolling strategy, stride count, etc.
    [-]
    - seddonm1 6 hours ago
      Based on the now-deprecated Clear Linux it does seem that these optimizations add up [0] and so maybe we should be considering them more broadly?
      [0] https://www.phoronix.com/review/clear-linux-48p-ubuntu/6
      [-]
      - Someone 2 hours ago
        AFAIK, that wasn’t solely a matter of picking the optimal compilation flags. It also included profile-guided optimizations and kernel tweaks.
        [-]
        XorNot 2 hours ago
        Doesn't that get into the domain of a distro like Gentoo though? (Or sort of Nix as well) - rebuild everything to precisely target your actual architecture.
        [-]
        monster_truck 56 minutes ago
        Yes, and the maximally performant configurations are at odds with a usefully hardened security posture. This only gets more true as time goes on and additional mitigations pile up.
    - fweimer 4 hours ago
      POPCNT is an interesting example. Runtime dispatch (with a conditional branch) would actually make sense for it because it's comparatively difficult to implement from scratch. PDEP and PEXT might be similar (but I don't think compilers pattern-match for it, unlike POPCNT). AArch64 uses localized run-time dispatch extensively because LL/SC atomics are so very bad on current cores, but there isn't anything comparable in the x86-64 space (POPCNT isn't that frequent).
      For many other things, like using a YMM register to copy a 32-byte struct or a variable shift, run-time dispatch just not make sense. You will only see a benefit if you generate this code unconditionally. For FMA, you wouldn't even get bit-identical output, leading to testing concerns.
    - throawayonthe 5 hours ago
      > Also, FTA: “and arguably the whole scheme should be replaced by finer-grained feature detection”. Such feature detection would lead to a combinatorial explosion of different binaries.
      the thread is about runtime detection tbf
      [-]
      - Someone 3 hours ago
        I see I wasn’t clear enough. The tool I discussed generates multiple binaries and then packs all of them into a single binary. I was referring to the former.
        https://github.com/ronnychevalier/cargo-multivers:
        “After building the different versions, it computes a hash of each version and it filters out the duplicates (i.e., the compilations that gave the same binaries despite having different CPU features). Finally, it builds a runner that embeds one version compressed (the source) and the others as compressed binary patches to the source. For instance, when building for the target x86_64-pc-windows-msvc, by default 4 different versions will be built, filtered, compressed, and merged into a single portable binary.
        When executed, the runner uncompresses and executes the version that matches the CPU features of the host.”
        Hopefully (and likely) the patches will not be too large, but for 6 binary compiler flags, you’d still have 2⁶ binaries.
        [-]
        wongarsu 1 hour ago
        Yeah, but that's because of pragmatic choices to limit the scope of the tool. In the wider context of "I've long been surprised there isn't more multiversion stuff built right into every language compile" it's easy to imagine a compiler that can heuristically detect which functions would benefit from certain CPU features, and walk over the call graph to find locations for runtime feature detection that balance detection overhead with code duplication for the fallback functions. For example merging the feature detection of adjacent function calls, making sure feature detection is moved out of hot loops, etc.
        Obviously this is much easier to imagine than to implement. And in some languages it might be made impossible by certain language features (function pointers might become tricky). But this is more or less what some people do by hand in Rust with the more manual is_x86_feature_detected macro, so there's no obvious reasons why compilers couldn't automate it in at least some languages.
      - rob74 4 hours ago
        Ok, then it will be an explosion of binary size, if you have several code blocks optimized for each architecture level - I'm not very familiar with the subject, but I imagine it would have to be relatively large chunks of code, otherwise the constant branching would eat up the speed advantage.
        [-]
        masklinn 3 hours ago
        These are usually pretty tight loops or constructs based on specific features.
        An unspecialised popcnt is half the dozen instructions, for specialised versions it’s 4 implementations ranging from half a dozen to two dozen bytes.
cperciva 6 hours ago
arguably the whole scheme should be replaced by finer-grained feature detection.
This seems like a strange thing to say. Fine grained feature detection was around long before "microarchitecture levels" and never went away. The microarchitecture levels were introduced because they were easier to use.
vintagedave 3 hours ago
This measurement is very focused on bit-related instructions. A few months ago we did some work on our (RemObjects) toolchain, and it showed very similar results, where our published benchmarks were for floating point.[0] I did some rough internal measurements showing the same for integer-related instructions too.
The same conclusion: v2 as baseline, v3 where possible.
I'm really surprised it's not standard in every toolchain to support arch levels like this today.
Some compilers like Clang allow multiple arch versions in one binary, runtime dispatched. I would love to implement this in our toolchain too.
[0] Please forgive the SEO-style title, it's, well, to get search engines to recognise what's in the article: https://blogs.remobjects.com/2026/01/26/fast-math-in-six-lan...
deathanatos 5 hours ago
> That is a 43% reduction, and it is free: no source change, just a compiler flag.
It's not entirely free; the cost is that the resulting binary will no longer run on processors that lack the instruction. Which, admittedly, is ≈2007 or older. But still! I have a 2012 CPU still in service, and as much as I'd love to obsolete it, gestures at the price tag of RAM these days.
… a 2012 CPU is surprisingly competitive relative to today's tech, too, I'd add. The gap between 2012 and 2026 is nothing compared to the equivalent gap between 1998 and 2012: 1998 is like 500MHz single-core, 32-bit. 2012 is 4 core, 8 hyper threads, 64-bit, 3.5 GHz. (… perhaps more remarkably, my next-oldest machine, a 2017 laptop, is only 2.8 GHz, with the same 4(/8) cores. It also uses like half the power, too. That's mostly the "laptop" bit, though.)
(That same CPU is also incapable of "v3".)
[-]
- tgv 5 hours ago
  My main problem was that our hosting company offers cheap Linux servers, but with a shared CPU that even doesn't support v2. We pay more now, but you could still run into that problem.
  [-]
  - fweimer 4 hours ago
    It's likely this is a hypervisor misconfiguration. Either way, one has to wonder what kind of mitigations for cross-tenant leakage they are missing.
    [-]
    - tgv 3 hours ago
      Or on purpose, because the CPUs with AVX are more expensive. Or historical: the hardware for this kind of service may have been old, and you can't tell people that if you buy today you get a processor with AVX, but tomorrow you may get one without. I haven't checked if they upgraded their low-cost options in a while.
    - rob74 3 hours ago
      Either it's a misconfiguration, or it's intentional (only providing a "bare-bones" machine for the lowest price level, even if the underlying hardware would support more)?
GianFabien 5 hours ago
I think the more critical question is how well compiler writers can update the heuristics which identify the instruction sequences that benefit from the architectural features. Last I looked, Intel has several thousand intrinsics which must be explicitly invoked to make use of specific features.
I suspect that heavily optimised code either uses intrinsics or carefully written assembler code.
[-]
- fweimer 4 hours ago
  Newer (relatively speaking) x86-64 instruction sets support many three-operand instructions, which are actually easier to use for compilers than instructions with overwritten source operands or hard register constraints. Pattern matching for instructions that do not have a direct C representation (such as NAND) is also pretty standard in compilers. Auto-vectorization is more tricky (especially when you want code to actually run faster …), but some of the new ISAs are impactful without it. And of course there are expanders for fixed-size memcpy and memset that can use wider vector instructions quite easily. Those operations are quite common.
- wahern 5 hours ago
  I think both AMD and Intel employ and/or fund GCC and LLVM developers to add support for each new architecture. Compiler and product release schedules are independent so the target and tuning support in the latest compiler release may be slightly behind or even ahead of the latest microarchitecture release. GCC 16.1 has support for Zen 6, which has even been released, yet. (https://gcc.gnu.org/gcc-16/changes.html#x86)
nevi-me 5 hours ago
Does Docker have uarch level support? I think similar to arch level, it could be beneficial being able to pull a v4 image.
Ubuntu started allowing defaulting to v3 packages, and I opted in. I already use the -C native to enable AVX512 when compiling binaries for local use. This matters a lot for compute/analytics workloads in my experience.
[-]
- tuetuopay 1 hour ago
  Yes, it's in the platform options. You can specify --platform linux/amd64/v3 for a v3 image.
kristianp 6 hours ago
I'm surprised that Go doesn't default to AVX2 support by now, considering that Haswell started shipping in mid 2013.
Speaking of Dr Lemire's suggestion of a V5 architecture level, would that make any sense given the fragmentation of AVX512? None on Intel consumer devices, but it is on the last few generations of AMD.
[-]
- adrian_b 2 hours ago
  The fragmentation of AVX-512 is a legacy problem.
  All the CPUs introduced after Ice Lake (Q3 2019), with the exception of Cooper Lake (Q2 2020; a server CPU with a modest installed base), which support any kind of AVX-512, support all the AVX-512 subsets of Ice Lake (which has very important additions over V4).
  This includes all AMD Zen 4, Zen 5 and Zen 6 CPUs, which form the bulk of the non-server CPUs that support AVX-512. Thus 6 years have passed since the introduction of an AVX-512 CPU that is not compatible with Ice Lake (and 7 years since any such CPU that was in widespread use).
  Both Intel and AMD have stated that from now on features will be added to AVX-512 (a.k.a. AVX10), not deleted, which will allow in the future the testing of the AVX10 version number to be sufficient for determining CPU capability in this domain.
  It would make sense to define a V5 level that includes all instructions of Ice Lake and also a V6 level, corresponding to AVX10.1 (Intel Granite Rapids) or to AVX10.2 (Intel Diamond Rapids).
- Am4TIfIsER0ppos 2 hours ago
  Here is a CPU from 2020 that does not support AVX nor AVX2 https://www.intel.com/content/www/us/en/products/sku/199288/... very low budget and probably not common but this is why one might choose not to require AVX2
pixelpoet 1 hour ago
These slop images he uses for his articles are so bad; here it says "accelation"...
jeffrallen 6 hours ago
This is one of the clearest example of diminishing returns I've ever seen. It comes up everywhere.
I wonder if this is a natural law, or emergent behavior of complex systems?
[-]
- ncruces 5 hours ago
  The last level is simply unused at this time.
  https://go.dev/wiki/MinimumRequirements#:~:text=The%20Go%20t...
- adrian_b 2 hours ago
  Diminishing returns apply to ISA extensions only on average, not for individual applications.
  Most of the recent additions in processor instruction sets are intended for relatively niche applications.
  In such cases, other applications will not be affected at all, but the specific application that is the target, for example a certain cryptographic algorithm or AI inference, may be accelerated many times when using the new ISA version instead of the old ISA version.
  Moreover, it is frequent that compilers are not smart enough to take advantage of such ISA extensions, so it is not enough to change the compilation flags, but you need to rewrite some library to get the full performance benefit. For example, many recent x86_64 CPUs have IFMA instructions (integer fused multiply-add instructions), which allow the use of the floating-point multipliers for doing arithmetic operations with big integer numbers (the advantage is that modern CPUs have many more FP multipliers than integer multipliers). This can accelerate a lot the computations with big numbers, but you need a complete carefully-written library that uses such instructions, you cannot just recompile some programs for making them run faster.
  From time to time it may still happen that some ISA extension has a wider applicability, being able to accelerate many applications, possibly just by recompilation, like Intel hopes to happen with the APX extension that will arrive early next year, in the Intel Nova Lake and Diamond Rapids CPUs.
  Most non-professional computer users are biased toward single-threaded application performance, where diminishing returns have already been seen for more than 2 decades.
  On the other hand for multi-threaded application throughput, we have not reached yet any diminishing returns. The throughput per CPU socket has continued to increase in geometric progression every year until now. The only serious problem is that starting around 10 years ago, from the days of Intel Kaby Lake and Coffee Lake, the price of computers has started to increase and the increase rate has accelerated recently.
  So now the possible throughput for a given computer size becomes less and less relevant in comparison with the throughput per dollar, and for the throughput per dollar it appears that we have already entered the region of diminishing returns (i.e. with unlimited budget you can still buy computers whose throughput increases in geometric progression each year, but the computers that you can actually still afford have a throughput that increases much slowlier).
andrewstuart 6 hours ago
I would have thought you’d need to explicitly code to match the cpu capabilities to your application, for maximum benefit.
[-]
- adrian_b 2 hours ago
  This is what you should always do when you know on which computers you will run the program, like I do when writing programs for my own computers.
  If you are a software vendor, or even just a contributor to some open-source program, you must make some compromise between program performance and its ability to run without modifications on an as large number of computers as possible.
  Therefore you must either avoid any features available only in newer computers, or you must have some kind of processor capability detection at run time, followed by the selection of appropriate program variants.
  You might not afford to prepare enough program variants, so it is likely that you would still choose to not support the most recent computers.
haeseong 7 hours ago
[dead]
stefantalpalaru 3 hours ago
[dead]
pjmlp 6 hours ago
Nothing, because this is a compiler question, not a language one.
[-]
- tgv 5 hours ago
  One of Go's selling points is performance. Another is easy deployment on a lot of platforms. This post is interesting from that perspective.
  Edit: to address your literal remark: so even the title is correct, if you think of a programming language as more than its syntax.
  [-]
  - pjmlp 4 hours ago
    Language !== Implementation, so no the title isn't correct.
    Go's selling point is definitely not performance.
    [-]
    - arghwhat 46 minutes ago
      Go's selling point is most definitely performance, but relative to implementation effort of a given application. This is opposed to languages that focus more on maximum performance at any cost to implementation, or maximum convenience at any cost to performance.
      [-]
      - pjmlp 27 minutes ago
        Basically a political answer that answers nothing.
        It isn't performance compiling, as that is only surprising for those that never used 90's compiled languages like Modula-2, Object Pascal, Clipper and co.
        It isn't performance of code execution, as even GCCGO could beat the reference implementation, unfortunately now stagnant since no one cares to update it beyond Go 1.18.
        And to go back to the article, as pointed out there,
        > The Go toolchain does not currently generate any AVX512 instructions.
        Thus leaving performance on the table.