The state of SIMD in Rust in 2025

(shnatsel.medium.com)

212 points | by ashvardanian 17 hours ago

14 comments

  • silentvoice 8 minutes ago
    oh boy I've got opinions here.

    Basically I just don't want to hear about "the state of SIMD in Rust" unless it is about dramatic improvement in autovectorization in the rust compiler.

    80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized. Intrinsics get you the rest of the way on harder code. Autovectorization is essentially a solved problem for the vast majority of floating point code.

    Not so with rust, because of a dogmatic approach to floating point arithmetic that assumes bitwise reproducibility is the "right" answer for everyone (actually, it's the right answer to almost nobody) to the point of not even allowing a user to even flag on these optimizations. and once you get to the point of writing intrinsics you have to handwrite code for every new architecture when autovectorizers could have gotten you 80%-90% of the way there with a single source and often this is just enough.

    the contention with the above is that if a user needs SIMD they can just use some SIMD API and make their intention more clear. this is essentially an argument that we should handwrite intrinsics. well guess what. I'm a programmer and I use compilers because they _do this for me_ and indeed are able to do so very easily in C or C++ when I instruct it that I'm ok with with reordering operations and other "accuracy impacting" optimizations.

    The huge joke on us is that these optimizations generally have the effect of _improving_ accuracy because it will reduce the number of rounding steps either by simply reducing the number of operations or by using fused multiply adds which round only once.

  • kouteiheika 1 hour ago
    Unfortunately SIMD in Rust tends to be pretty painful if you want to gracefully do runtime autodetection of a given SIMD extension (instead of it being a hard requirement for your program to even run).

    The major problem is that Rust essentially requires you to annotate every (!) function in your whole call stack with e.g. `#[target_feature(enable = "avx2")]` to make sure that the SIMD intrinsics will actually get inlined (if they're not inlined then the performance is horrible, which makes using SIMD completely pointless). This makes it very hard to build any reasonable abstractions because you need to hardcode this all over your code. You can't have e.g. a `DataStructure<S>` where S is the SIMD ISA, so that you could do `DataStructure<AVX2>` or `DataStructure<SSE>` to get a nicely specialized version of it for a given instruction set. You need to copy-paste the whole thing with changed `target_feature` attributes (or use a procedural macro which does the copy-pasting) and have two entirely separate `DataStructureAVX2` and `DataStructureSSE` types.

  • andyferris 14 hours ago
    Regarding autovectorization:

    > The other drawback of this method is that the optimizer won’t even touch anything involving floats (f32 and f64 types). It’s not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but it’s currently nightly-only).

    Ah - this makes a lot of sense. I've had zero trouble getting excellent performance out of Julia using autovectorization (from LLVM) so I was wondering why this was such a "thing" in Rust. I wonder if that nightly feature is a per-crate setting or what?

    • CryZe 3 minutes ago
      There are algebraic operations available on nightly: https://doc.rust-lang.org/nightly/std/primitive.f32.html#alg...
    • Sharlin 50 minutes ago
      LLVM autovectorizes many FP operations just fine, the article was a bit strange in that respect. Problem is, there are many other cases where it's unable to do so, not because it can't but because it isn't allowed.
      • exDM69 36 minutes ago
        In my experience, compiling C with -ffast-math will tremendously improve floating point autovectorization and optimizations to SIMD (C vector extensions, which are similar to Rust std::simd) code in general.

        This obviously has a lot of caveats, and should only be enabled on a per function or per file basis.

        Unfortunately Rust does not currently have options for adjusting per-function compiler optimization parameters. This is possible in some C compilers using function attributes.

    • vlovich123 13 hours ago
      Does Julia ignore the problem of floating point not being associative, commutative nor distributive?

      The reason it’s a thing is from LLVM and I’m not sure you can “language design” your way out of this problem as it seems intrinsic to IEEE 754.

      • ChrisRackauckas 12 hours ago
        No it only uses the same LLVM compiler passes and you enable certain optimizations locally via macros if you want to allow reordering in a given expression.
      • tomsmeding 13 hours ago
        Nitpick, but IEEE float operations are commutative (when relevant and appropriate). Associative and distributive they indeed are not.
        • vlovich123 13 hours ago
          Unless I’m having a brain fart it’s not commutative or you mean something by “relevant and appropriate” that I’m not understanding.

          a+b+c != c+b+a

          That’s why you need techniques like Kahan summation.

          • amluto 10 hours ago
            I think the other replies are overcomplicating this.

            + is a binary operation, and a+b+c can’t be interpreted without knowing whether one treats + as left-associative or right-associative. Let’s assume the former: a+b+c really means (a+b)+c.

            If + is commutative, you can turn (a+b)+c into (b+a)+c or c+(a+b) or (commuting twice) c+(b+a).

            But that last expression is not the same thing as (c+b)+a. Getting there requires associativity, and floating point addition is not associative.

          • wtallis 13 hours ago
            "a+b+c" doesn't describe a unique evaluation order. You need some parentheses to disambiguate which changes are due to associativity vs commutativity. a+(b+c)=(c+b)+a should be true of floating point numbers, due to commutativity. a+(b+c)=(a+b)+c may fail due to the lack of associativity.
            • adastra22 13 hours ago
              It is not, due to precision. Consider a=1.00000, b=-0.99999, and c=0.00000582618.
              • jcranmer 12 hours ago
                No, the two evaluations will give you exactly the same result: https://play.rust-lang.org/?version=stable&mode=debug&editio...

                IEEE 754 operations are nonassociative, but they are commutative (at least if you ignore the effect of NaN payloads).

                • dbdr 4 hours ago
                  Is there a case involving NaN where they are not commutative? Do you mean getting a different bit-level representation of NaN?
                  • Remnant44 3 hours ago
                    In practical use for simd, various min/max operations. On Intel at least, they propagate nan or not based on operand order
                • imtringued 2 hours ago
                  https://play.rust-lang.org/?version=stable&mode=debug&editio...

                  You're supposed to do (a+b) to demonstrate the effect, because floating point subtraction that results in a number near zero is sensitive to rounding (worst case, a non-zero number gets you a zero number), which can introduce a huge error when a and b are very similar numbers.

              • zygentoma 12 hours ago
                You still need to specify an evaluation order …
              • immibis 12 hours ago
                Does (1.00000+-0.99999)+0.00000582618 != 0.00000582618+(-0.99999+1.00000) ? This would disprove commutativity. But I think they're equal.
          • dataangel 9 hours ago
            For those to be equal you need both associativity and commutativity.

            Commutativity says that a*b = b*a, but that's not enough to allow arbitrary reordering. When you write a*b*c depending on whether * is left or right associative that either means a*(b*c) or (a*b)*c. If those are equal we say the operation is associative. You need both to allow arbitrary reordering. If an operation is only commutative you can turn a*(b*c) into a*(c*b) or (b*c)*a but there is no way to put a in the middle.

          • wfleming 12 hours ago
            We’re in very nitpicky terminology weeds here (and I’m not the person you’re replying to), but my understanding is “commutative” is specifically about reordering operands of one binary op (4+3 == 3+4), while “associative” is about reordering a longer chain of the same operation (1+2+3 == 1+3+2).

            Edit: Wikipedia actually says associativity is definitionally about changing parens[0]. Mostly amounts to the same thing for standard arithmetic operators, but it’s an interesting distinction.

            [0]: https://en.wikipedia.org/wiki/Associative_property

            • nyrikki 12 hours ago
              It is not a nit it is fundamental, a•b•c is associativity, specifically operator associativity.

              Rounding and eventual underflow in IEEE means an expression X•Y for any algebraic operation • produces, if finite, a result (X•Y)·( 1 + ß ) + µ where |µ| cannot exceed half the smallest gap between numbers in the destination’s format, and |ß| < 2^-N , and ß·µ = 0 . ( µ ≠ 0 only when Underflow occurs.)

              And yes that is a binary relation only

              a•b•c is really (a•b)•c assuming left operator associativity, one of the properties that IEEE doesn't have.

          • nyrikki 13 hours ago
            IEEE 754 floating-point addition and multiplication are commutative in practice, even if there are exceptions with NaNs etc..

            But remember that commutative is on the operations (+,x) which are binary operations, a+b=b+a and ab=ba, you can get accumulated rounding errors on iterated forms of those binary operations.

    • Arch-TK 14 hours ago
      It's not something you seem to be able to just enable globally. From what I gather this is what is being referenced:

      https://doc.rust-lang.org/std/intrinsics/index.html

      Specifically the *_fast intrinsics.

      • ladyanita22 4 hours ago
        Is this equivalent to --ffast-math?
        • Arch-TK 3 hours ago
          From what I know of -ffast-math and can read from the docs for *_fast. I am not convinced that the *_fast intrinsics do _everything_ -ffast-math allows. They seem focused around algebraic equivalence (a/b is equivalent to a*(1/b) ) and assumptions of finite math. There's a few other things that -ffast-math allows like ignoring certain errors, ignoring the existence of signed zero, ignoring signalling NaN handling, ignoring SIGFPE handling, etc...
          • Sharlin 48 minutes ago
            Yes, because many of the traditional "fast math" assumptions are definitely not something that should be hidden behind an attractive option like that. In particular assuming the nonexistence of NaNs is essentially never anything but a ticket to the UB land.
    • dzaima 12 hours ago
      For vectorizing, that quote is only true for loops with dependencies between iterations, e.g. summing a list of numbers (..that's basically the only case where this really matters).

      For loops without such dependencies Rust should autovectorize just fine as with any other element type.

      • galangalalgol 9 hours ago
        You just create f32x4 types, the wide crate does this. Then it autovectorizes just fine. But it still isn't the best idea if you are comparing values. We had a defect due to this recently.
        • the__alchemist 8 hours ago
          I suspect I am misunderstanding. If you create an f32x4 type, aren't you manually vectorizing? Auto-vectoring is magic SIMD use the compiler does in some cases. (But usually doesn't...)
          • galangalalgol 3 hours ago
            You are manually vectorizing, but it lets the optimizer know you don't care about safe rounding behavior so it ends up using the simd instructions. And this way it is portable still vs using intrinsics. Floating point addition is the only one the optimizer isn't allowed to do, so if you just need multiplication or only use integers it all autovectorizes fine. The f32xN stuff is just a way to tell it you don't care about the rounding. There are better ways to do that that could be added, like a FastF32 type, but I don't know if llvm could support that.

            Edit: go to godbolt and load the rust aligned sum and play around with types. If you see addps that is the packed scalar simd instruction. The more you get packed, the higher your score! You'll need to pass some extra arguments they don't list to get avx512 sized registers vs the xmm or ymm ones. And not all the instances it uses support avx512 so sometimes you have to try a couple times.

            • dzaima 2 hours ago
              Well, not really "you don't care about safe rounding behavior", more just "you have specified a specific operation order that happens to be more susceptible to being vectorizable". Implementing a float sum that way has the completely-safe completely-well-defined portable behavior of summing strides for any given size.

              Both float multiplication and float addition are equally bad for optimizations though - both are non-associative: https://play.rust-lang.org/?version=stable&mode=debug&editio... ; and indeed changing the aligned-sum example to f64, neither .sum() nor .product() get vectorized.

              And e.g. here's a plain rust loop autovectorizing both addition and multiplication (though of course not a reduction): https://rust.godbolt.org/z/6hEcj8zfx

              • galangalalgol 54 minutes ago
                I meant was multiply two vectors point by point autovecs, because there is no order. I'm usually doing accumulated products or something like them for dsp. As long as you only use the wide it is fine. I had a bug when comparing values constructed partially from simd vs not at all. Very unusual I'm sure, but there really is a reason rust won't let you turn on ffastmath
    • bobmcnamara 13 hours ago
      We used to tweak our scalar product simulator code to match the SIMD arithmetic order so we could hash the outputs for tests.

      I wonder if it could autovec the simd-ordered code.

    • queuebert 8 hours ago
      Does Rust not have the equivalent of GCC's "-ffast-math"?
      • Sharlin 39 minutes ago
        No, because as I commented in another subthread, `-ffast-math` is:

        1. dangerous assumptions hidden behind a simple, attractive-looking option [1]. It should be called -fwrong-math or -fdangerous-math or something (GCC does have the funnily named switch -funsafe-math-optimizations – what could go wrong with fun, safe math optimizations?!)

        2. Translation-unit scoped, which means that dependencies not consented to "fast math" can break your code (as in UB land) or make the optimizations pointless, and your code can break your dependencies' semantics too via inlining. On the other hand, a library author must think very carefully what float opts to enable in order to be compatible with client code.

        Deciding how the scoping of non-IEEE float math operations should work is a very nontrivial question. The scope could be a translation unit, a module, a type, a function, a block, or every individual operation, and none of those is without issues, particularly regarding questions like inlining and interprocedural and link-time-optimization, as well as ergonomics. In other ways, it's yet another function coloring problem.

        There are currently-unstable "algebraic_add/mul/etc" methods for floats for letting LLVM treat those particular operations as if floats were real numbers [2]. They're the first step towards safe UB-free float optimizations, but of course those names are rather awkward to use in math-heavy code, and a wrapper type overloading the normal operators would be good to have.

        ---

        [1] See, eg. https://simonbyrne.github.io/notes/fastmath/

        [2] In terms of associativity and such, not in eg. assuming the nonexistence of NaNs, which would be very unsafe.

      • demurgos 48 minutes ago
        No it doesn't. A global flag is a no-go as it breaks modularity. A local opt-in through dedicated types or methods is being designed but it's not stable.
  • bencyoung 16 hours ago
    Odd that c# has a better stable SIMD story than Rust! It has both generic vector types across a range of sizes and a good set of intrinsics across most of the common instruction sets
    • kelnos 16 hours ago
      Why would that be odd? C# is an older and mature language backed by a corporation, while Rust is younger and has been run by a small group of volunteers for years now.
      • josefx 8 hours ago
        > hile Rust is younger and has been run by a small group of volunteers for years now

        I thought Rust was getting financial support from Google, Microsoft and Mozilla? Or was the Rust Foundation just a convenient way for Mozilla to fire a large amount of developers and we are actually rapidly approaching the OpenSSL Heartbleed state. Where everyone is happily building on a secure foundation that is maintained by a half dead intern when he isn't busy begging for scraps on the street?

        • testdelacc1 6 hours ago
          Mozilla hasn’t supported development on the Rust project for about 5 years now, since laying off all the developers working on it in August 2020.

          Since then several Rust project developers did find full time jobs in companies like Amazon, Meta etc. But corporate interest ebbs and flows. For example Huawei employed a couple of engineers to improve the compiler for several years but they lost interest a couple of months ago.

          The Rust Foundation is powered by donations, but a lot of its expenses went on funding infrastructure, security, legal expenses. But this problem of funding the maintainers of the project is on their radar. Yesterday they started an initiative to fund raise for the Maintainers Fund, with the money going straight to maintainers who aren’t being paid by their employer to do it full time. (https://rustfoundation.org/media/announcing-the-rust-foundat...)

      • bencyoung 2 hours ago
        Not majorly odd, just an area I thought Rust would be hot on when it comes to performance...
      • booi 15 hours ago
        not just any corporation.. the largest software corporation on the planet
        • Arch-TK 14 hours ago
          not just any largest software corporation, one of my two least favourite largest software corporations on the planet.
          • hu3 12 hours ago
            not just any least favourite largest software corporation of yours...

            the one that most contributes to open source from the largest corporations. so one of my favourites because of that

            they were also one of the first of the large corps to show interest in Rust

    • exyi 16 hours ago
      C# portable SIMD is very nice indeed, but it's also not usable without unsafety. On the other hand, Rust compiler (LLVM) has a fairly competent autovectorizer, so you may be able to simply write loops the right way instead of the fancy API.
      • bencyoung 2 hours ago
        Having worked in HPC a fair bit I'm not a fan of autovectorization. I prefer the compiled code's performance to be "unsuprising" based on the source and to use vectors etc where I know it's needed. I think in general it's better to have linting that points out performance issues (e.g. lift this outside the loop) rather than have compilers do it automatically and make things less predictable
      • buybackoff 14 hours ago
        Unsafety means different things. In C#, SIMD is possible via `ref`s, which maintains GC safety (no GC holes), but removes bounds safety (array length check). The API is called appropriately Vector.LoadUnsafe
      • neonsunset 15 hours ago
        You are not "forced" into unsafe APIs with Vector<T>/Vector128/256/512<T>. While it is a nice improvement and helps with achieving completely optimal compiler output, you can use it without unsafe. For example, ZLinq even offers .AsVectorizable LINQ-style API, where you pass lambdas which handle vectors and scalars separately. It the user code cannot go out of bounds and the resulting logic even goes through (inlined later by JIT) delegates, yet still offers a massive speed-up (https://github.com/Cysharp/ZLinq?tab=readme-ov-file#vectoriz...).

        Another example, note how these implementations, one in unsafe C# and another in safe F# have almost identical performance: https://benchmarksgame-team.pages.debian.net/benchmarksgame/..., https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

    • jiehong 16 hours ago
      C# is blessed on that front. Java’s SIMD state is still sad, and golang is not as great either.
      • ashf023 16 hours ago
        Yeah, golang is a particular nightmare for SIMD. You have to write plan 9 assembly, look up what they renamed every instruction to, and then sometimes find that the compiler doesn't actually support that instruction, even though it's part of an ISA they broadly support. Go assembly functions are also not allowed to use the register-based calling convention, so all arguments are passed on the stack, and the compiler will never inline it. So without compiler support I don't believe there's any way to do something like intrinsics even. Fortunately compiler support for intrinsics seems to be on its way! https://github.com/golang/go/issues/73787
        • Thaxll 13 hours ago
          Go has been using register based calling for a while now?
          • immibis 12 hours ago
            GP comment said it's not used for FFI, not that it's not used.
      • soupy-soup 11 hours ago
        To be fair, Java's lack of support seems to have more to do with them needing to fix the whole primitive vs object mess rather than a lack of effort. It sounds like the Vector API will be stabilized shortly after they figure that out, but who knows how long it will take.
      • pjmlp 3 hours ago
        While it is blocked on Valhala, it is quite usable, if folks use nightly all the the time with Rust, what is the problem with --preview?
    • fulafel 4 hours ago
      How much of this is due to use in games and Mono?

      Eg https://tirania.org/blog/archive/2008/Nov-03.html

  • josephg 17 hours ago
    Why isn’t std::simd in stabile yet? Why do so many great features seem stuck in the same nightly-forever limbo land - like generators?

    I’m sure more people than ever are working on the compiler. What’s going on?

    • ChadNauseam 16 hours ago
      There really aren't that many people working on the compiler. It's mostly volunteers.

      The structure is unlike a traditional company. In a traditional company, the managers decide the priorities and direct the employees what to work on while facilitating that work. While there are people with a more managerial type position working on rust compiler, their job is not to tell the volunteers what to work on (they cannot), but instead to help the volunteers accomplish whatever it is they want to do.

      I don't know about std::simd specifically, but for many features, it's simply a case of "none of the very small number of people working on the rust compiler have prioritized it".

      I do wish there was a bounty system, where people could say "I really want std::simd so I'll pay $5,000 to the rust foundation if it gets stabilized". If enough people did that I'm sure they could find a way to make it happen. But I think realistically, very few people would be willing to put up even a cent for the features they want. I hear a lot of people wishing for better const generics, but only 27 people have set up a donation to boxy (lead of the const generics group https://github.com/sponsors/BoxyUwU ).

      • LtWorf 13 hours ago
        > There really aren't that many people working on the compiler. It's mostly volunteers.

        Seems smart to put the language as a requirement for compiling the linux kernel and a bunch of other core projects then!

        • ChadNauseam 13 hours ago
          I think it seems just right. Languages these days are either controlled by volunteers or megacorps. Because linux is about freedom and is not aligned with megacorps, I think they'd prefer a volunteer-driven language like Rust or C++ rather than the corporate ones.
          • pjmlp 3 hours ago
            C++ has been an industrial language from the early days, since it got adopted among C compiler vendors back in the 1980's.
          • FuckButtons 13 hours ago
            I’m not sure you can argue that Rust and C++ have anything like a similar story around being volunteer oriented, given the number of places that have C++ compiler groups that contribute papers / implementations.
          • immibis 12 hours ago
            I'm not sure you can claim that Linux is about freedom. Linux is run by a bunch of corps and megacorps who are otherwise competing, not by volunteers.
        • lmm 11 hours ago
          I mean, Linux development works exactly the same way.
          • pclmulqdq 8 hours ago
            Linux has a BDFL and a quasi-corporate structure that keeps all the incentives aligned. Rust has neither of those.
          • LtWorf 5 hours ago
            I think it's quite rare for linux developers to not do it on behalf of some company.

            Weren't a bunch of modules deprecated recently as a consequence of intel layoffs?

            • lmm 5 hours ago
              > I think it's quite rare for linux developers to not do it on behalf of some company.

              Corporate-sponsored contributions are probably the majority, but I don't think true volunteers are super-rare. But in both cases they're a "volunteer" from the perspective of the Linux leadership - they're contributing the changes that they want to make, they're not employees of Linux who can be directed to work on the things that that leadership thinks is important.

              (And conversely it's the same with Rust - a lot of those volunteer contributors work for employers who want Rust to have some specific functionality, so they employ someone to work on that)

    • JoshTriplett 15 hours ago
      > Why isn’t std::simd in stable yet?

      Leaving aside any specific blockers:

      - It's a massive hard problem, to build a portable abstraction layer over the SIMD capabilities of various CPUs.

      - It's a massive balance between performance and usability, and people care deeply about both.

      - It's subject to Rust's stability guarantee for the standard library: once we ship it, we can't fix any API issues.

      - There are already portable SIMD libraries in the ecosystem, which aren't subject to that stability guarantee as they can ship new semver-major versions. (One of these days, I hope we have ways to do that for the standard library.)

      - Many people already use non-portable SIMD for the 1-3 targets they care about, instead.

      • exDM69 25 minutes ago
        Despite all of these issues you mention, std::simd is perfectly usable in the state it is in today in nightly Rust.

        I've written thousands and thousands of lines of Rust SIMD code over the last ~4 years and it's, in my opinion, a pretty nice way of doing SIMD code that is portable.

        I don't know about the specific issues in stabilization, but the API has been relatively stable, although there were some breaking changes a few years ago.

        Maybe you can't extract 100% of your CPUs capabilities using it, but I don't find that a problem because there's a zero-cost fallback to CPU-specific intrinsics when necessary.

        I recently wrote some computer graphics code and I could get really nice performance (~20x my scalar code, 5x from just a naive translation). And the same codebase can be compiled to AVX2, SSE2 and ARM NEON. It uses f32x8's (256b vector width), which are not available on SSE or NEON, but the compiler can split those vectors. The f32x8 version was faster than f32x4 even on 128b hardware. I would've needed to painstakingly port this codebase to each CPU, so it was at least a 3x reduction in lines of code (and more in programmer time).

      • colonial 13 hours ago
        > Many people already use non-portable SIMD for the 1-3 targets they care about, instead.

        This is something a lot of people (myself included) have gotten tripped up by. Non-portable SIMD intrinsics have been stable under std::arch for a long time. Obviously they aren't nearly as nice to hold, but if you're in a place where you need explicit SIMD speed-ups, that probably isn't a killer.

        • JoshTriplett 13 hours ago
          Exactly. Many parts of SIMD are entirely stable, for x86, ARM, WebAssembly...

          The thing that isn't stable in the standard library is the portable abstraction layer atop those. But several of those exist in the community.

      • vlovich123 15 hours ago
        > we can't fix any API issues.

        Can’t APIs be fixed between editions?

        • JoshTriplett 15 hours ago
          Partially (with upcoming support for renaming things across editions), but it's a pain if the types change (because then they're no longer common vocabulary), and all the old APIs still have to exist.
    • singron 16 hours ago
      There is a GitHub issue that details what's blocking stabilization for a each feature. I've read a few recently and noticed some patterns:

      1. A high bar for quality in std

      2. Dependencies on other unstable features

      3. Known bugs

      4. Conflicts with other unstable features

      It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.

      I think there is also some sampling bias. Tons of features get stabilized, but you are much more likely to notice a nightly feature that is unstable for a long time and complex enough to be excited about.

      • throwup238 13 hours ago
        > It seems anything that affects trait solving is very complicated and is more likely to have bugs or combine non-trivially with other trait-solving features.

        Yep and this is why many features die or linger on forever. Getting the trait solving working correctly across types and soundly across lifetimes is complicated enough to have killed several features previously (like specialization/min_specialization). It was the reason async trait took so long and why GAT were so important.

      • vlovich123 15 hours ago
        > Dependencies on other unstable features

        AFAIK that’s not a blocker for Rust - the std library is allowed to use unstable at all times.

        • estebank 15 hours ago
          I think they meant on unstable features which might yet change their semantics. A stable API relying on unstable implementation is common in Rust (? operator, for example), but that is entirely dependent on having a good idea of what the eventual stable version is going to look like, in such a way that the already stable feature won't break in any way.
    • Avi-D-coder 16 hours ago
      Usually when I go and read the github and zulip threads the reason for paused work comes down to the fact that no one has come up with a design that maintains every existing promise the compiler has made. The most common ones I see are the feature conflicts with safety, semver/encapsulation, interacts weirdly with object safety, causes post post-monomorphization errors, breaks perfect type class coherence (see haskells unsound specialization).

      Too many promises have been made.

      Rust needs more unsafe opt outs. Ironically simd has this so it does not bother me.

    • capyba 12 hours ago
      Given the “blazingly fast” branding, I too would have thought this would be in stable Rust by now.

      However, like other commenters I assume it’s because it’s hard, not all that many users of Rust really need it, and the compiler team is small and only consists of volunteers.

      • jandrewrogers 9 hours ago
        Getting maximum performance out of SIMD requires rolling your own code with intrinsics. It is something a compiler can't do for you at a pretty fundamental level.

        Most interesting performance optimizations from vector ISAs can't be done by the compiler.

        • exDM69 22 minutes ago
          > Getting maximum performance out of SIMD requires rolling your own code with intrinsics

          Not disagreeing with this statement in general, but with std::simd I can get 80% of the performance with 20% of the effort compared to intrinsics.

          For the last 20%, there's a zero cost fallback to intrinsics when you need it.

        • capyba 2 hours ago
          Interesting, how so? I’ve had really good success with the autovectorization in gcc and the intel c compiler. Often it’s faster than my own instrinsics, though not always. One notable example though is that it seems to struggle with reduction - when I’m updating large arrays ie `A[i] += a` the compiler struggles to use simd for this and I need to do it myself.
      • queuebert 8 hours ago
        I do scientific computing, and even I rarely have a situation where CPU SIMD is a clear win. Usually it's either not worth the added complexity, or the problem is so embarrassingly parallel that you should use a GPU.
        • capyba 2 hours ago
          Interesting, in what domain? My work is in scientific computing as well (finite elements) and I usually find myself in the opposite situation: SIMD is very helpful but the added complexity of using a GPU is not worthwhile.
      • steveklabnik 12 hours ago
        Don’t forget that autovectorization does a lot too. This is only for when you want to ensure you get exactly what you want, for many applications, they just kinda get it for free sometimes.
    • the__alchemist 16 hours ago
      Would love this. I've heard it's not planned to be in the near future. Maybe "perfect is the enemy of good enough"?
      • CooCooCaCha 16 hours ago
        Rust doesn’t have a BDFL so there’s nobody with the power to push things through when they’re good enough.

        And since Rust basically sells itself on high standards (zero-cost abstractions, etc.) the devs go back and forth until it feels like the solution is handed down from the heavens.

        • ChadNauseam 16 hours ago
          And somehow it has ended up feeling more pleasant and consistent than most languages with a BDFL, even though it was designed by committee. I don't really understand how that happened, but I appreciate the cautious and conservative approach they've taken
    • duped 13 hours ago
      std::arch::* intrinsics for SIMD are stable and you can use them today. The situation is only slightly worse than C/C++ because the rust compilers cares a lot about undefined behavior, so there's some safe-but-technically-unsafe/annoying cfg stuff to make sure the intrinsics are actually emitted as you intend.

      There is nothing blocking high quality SIMD libraries on stable in Rust today. The bar for inclusion in std is just much higher than the rest of the ecosystem.

    • stevefan1999 10 hours ago
      As someone who used std::simd in an attempt for submitting to an academic conference CFP*, I have look deep into how std::simd and I would conclude that there are a couple of reasons it isn't stable yet (this is rather long and maybe need 10 minutes to read):

      1. It is highly depending on LLVM intrinsics which itself can change quite a lot. Sometimes the intrinsic would even fail to instantiate and crashed the entire compilation. I for example met chronic ICE crashes for the same code in different nightly Rust version. Then I realize it is because the SIMD operation was too complicated and I need to simplify it, and sometimes need to stop recursing and expanding too much to prevent stack spilling and exhausting register allocation.

      This happens from time to time especially when using std::simd with embedded target where registers are scarcity.

      2. Some hardware design decisions making SIMD itself not ergonomic and hard to generalize, this is also reflected on the design of std::simd as well.

      Recall that SIMD techniques stems from vector processors in supercomputers from the likes of Cray and IBM, that is from the 70s and back then computation and hardware design was primitive and simple, so they have fixed vector size.

      The ancient design is very stable, and is still kept till this day, even with the likes of AVX2, AVX512, VFP and NEON, so this influenced the design of things like lane count (https://doc.rust-lang.org/std/simd/struct.LaneCount.html).

      But the plot twist: as time goes on, it turns out that modern SIMD is now capable of doing variable sizes; RISC-V's SIMD extension is one such implementation for example.

      So now we come to a dilemma on to keep the existing fixed lane count design, or allow it to extend further. If we allow it to extend further to cater for things like variable-SIMD vector length, then we need to wait for generic_const_exprs to be stable, and right now it is not only not stable but incomplete too (https://github.com/rust-lang/portable-simd/issues/416).

      This is a hard design philosophical change and is not easy to deal with. Time will tell.

      3. As an extension to #2, the way that thinking in SIMD is hard in the very first place, and to use it in production you even have to think about different situations. This come in the form of dynamic dispatch, and it is a pain to dealt with, although we have great helpers such as multiversion...it is still very hard to design an interface that scales. Take Google's highway (https://github.com/google/highway/blob/master/g3doc/quick_re...) for example, it is the library to write portable SIMD code with dynamic dispatch in C++, but in an esotheric and not so ergonomic way. How we could do better with std::simd is still a myth. How do you abstract the idea of scatter-gather operation? What the heck is swizzle? Why do we call it shuffle and not permutation. Lots of stuff to learn, that means lots of pain to go through.

      4. Plus, when you think in SIMD, there could be multiple instructions and multiple ways to do the same thing, one maybe more efficient than the other.

      For example, as I have to touch some finite field stuff in GF(2^8), there are few ways to do finite field multiplication:

      a. Precomputed table lookup

      b. Russian Peasant Multiplication (basically carryless Karatsuba multiplication, but oftenly reduce to the form of table lookups as well, can also seen as ripple counter with modulo arithmetic except carry has to be delivered in a different way)

      c. Do an inner product and then do Barrett reduction (https://www.esat.kuleuven.be/cosic/publications/article-1115...)

      d. Or just treat it as multiplcation over a polynominal power series but this essentially mean we treat it as a finite field convolution, which I suspect is highly related to fourier transform. (https://arxiv.org/pdf/1102.4772)

      e. Use the somewhat new GF2P8AFFINEQB (https://www.felixcloutier.com/x86/gf2p8affineqb) from GFNI which, contrary to most people who think it is available for AVX512 only, but is actually available for SSE/AVX/AVX2 as well (this is called GFNI-SSE in gcc), so it works on my 13600KF too (except obviously I cannot use ZMM registers or I just get illegal instruction for any instructions that touches ZMM or uses the EVEX encoding). I have an internal implementation of finite field multiplication using just that, but I need to use the polynomial of 0x11D rather than 0x11B so GF2P8MULB (https://www.felixcloutier.com/x86/gf2p8mulb) is out of question (which is supposed to be the fastest in the world theoretically if we can use arbitary polynomial), but this is rather hard to understand and explain in the first place. (by the way I used SIMDE for that: https://github.com/simd-everywhere/simde)

      All of these can be done in SIMD, but each one of these methods have its pros and cons. Table lookup maybe fast and seemingly O(1) but you actually need to keep the table in cache, meaning we trade time with space, and SIMD would amplify the cache thrashing from multiple access. This could slow down the CPU pipeline although modern CPU are clever enough on cache management. If you want to do Russian Peasant Multiplication then you need a bunch of loops to go through the division and XOR chunk by chunk.

      If you want Barrett reduction then you need to have efficient carryless multiplication such as PCLMULQDQ (https://www.felixcloutier.com/x86/pclmulqdq), to do the inner product and reduce the polynomial. Or a more primitive way find ways to do finite field Horner's method in SIMD...

      How to think in SIMD is already hard as said in #3. How to balance in SIMD like this is even harder.

      Unless you want to have a certain edge, or want to shatter the benchmark, I would say SIMD is not a good investment. You need to use SIMD at the right scenario at the right time. SIMD is useful, but also kind of niche, and modern CPU is optimized well enough that the performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether, and most of the time if you want to use SIMD, using the easiest SIMD methods is generally enough.

      *: I myself used std::simd intensively in my own project, well it got refused that the paper was actually severely lacking in literature studies, but that I shouldn't have used LLM too much to generate the paper.

      However, the code was here (https://github.com/stevefan1999-personal/sigmah). Now I have a new approach to this problem that is derived from my current work with finite field, error correction, divide and conquer and polynominal multiplication, and I plan to resubmit the paper once I have time to clear it, with a more careful approach next time too, although the problem of string matching with don't care can be seen as convolution and I doubt my approach would ended up something like that...making the paper still unworthy for acceptance.

      • janwas 3 hours ago
        > performance of general solutions without using SIMD, is good enough too, since all of which will eventually dump right down to the uops anyway, with deep pipeline, branch predictor, superscalar and speculative execution doing their magics altogether

        A quick comment on this one point (personal opinion): from a hyperscalar perspective, scalar code is most certainly not enough. The energy cost from scheduling a MUL instruction is something like 10x of the actual operation it performs. It is important to amortize that cost over many elements (i.e. SIMD).

      • eden-u4 4 hours ago
        wow, thanks for this long explanation.
    • IshKebab 16 hours ago
      I would love generators too but I think the more features they add the more interactions with existing features they have to deal with, so it's not surprising that its slowing down.
      • estebank 15 hours ago
        Generators in particular has been blocked on the AsyncIterator trait. There are also open questions around consuming those (`for await i in stream`, or just keep to `while let Some(i) in stream.next().await`? What about parallel iteration? What about pinning obligations? Do that as part of desugaring or making it explicit?). It is a shame because it is almost orthogonal, but any given decision might not be compatible with different approaches for generators. The good news is that some people are working on it again.
  • the__alchemist 16 hours ago
    Of interest, I've written my own core::simd mimic so I don't have to make all my libs and programs use nightly. It started as me just making my Quaternion and Vec lib (lin-alg) have their own SoA SIMD variants (Vec3x16 etc), but I ended up implementing and publicly exposing f32x16 etc. Will remove those once core::simd is stable. Downside: These are x86 only; no ARM support.

    I also added packing and unpacking helpers that assist with handling final lane 0 values etc. But there is still some subtly, as the article pointed out, compared to using Rayon or non-SIMD CPU code related to packing and unpacking. E.g. you should try to keep things in their SIMD form throughout the whole pipeline, how you pair them with non-SIMD values (Like you might pair [T; 8] with f32x8 etc) etc.

    • ____tom____ 16 hours ago
      I'm not a rust programmer.

      Can't you just make a local copy of the existing package and use that? Did you need to re-implement?

      • dzaima 15 hours ago
        The nightly built-in core::simd makes use of a bunch of intrinsics to "implement" the SIMD ops (or, rather, directly delegate the implementation to LLVM which you otherwise cannot do from plain Rust), which are as much if not more volatile than core::simd itself (and also nightly-only).
        • vlovich123 15 hours ago
          > or, rather, directly delegate the implementation to LLVM which you otherwise cannot do from plain Rust

          I thought the intrinsic specifically were available in plain safe rust and the alignment required intrinsics were allowed in unsafe rust. I’m not sure I understand this “direct to llvm dispatch” argument or how that isn’t accessible to stable Rust today.

          • dzaima 15 hours ago
            You can indeed use intrinsics to make a SIMD library in plain safe stable rust today to some extent; that just isn't what core::simd does; rather, on the Rust-side it's all target-agnostic and LLVM (or whatever other backend) handles deciding how to lower any given op to the target architecture.

            e.g. all core::simd addition ends up invoking the single function [1] which is then directly handled by rustc. But these architecture-agnostic intrinsics are unstable[2] (as they're only there as a building block for core::simd), and you can't manually use "#[rustc_intrinsic]" & co in stable rust either.

            [1]: https://github.com/rust-lang/rust/blob/b01cc1cf01ed12adb2595...

            [2]: https://github.com/rust-lang/rust/blob/b01cc1cf01ed12adb2595...

            • the__alchemist 14 hours ago
              This is what I ended up doing as a stopgap.
      • the__alchemist 16 hours ago
        Good question. Probably, but I don't know how and haven't tried.
  • jtrueb 16 hours ago
    simd was one I thought we needed. Then, i started benchmarking using iter with chunks and a nested if statement to check the chunk size. If it was necessary to do more, it was typically time to drop down to asm rather than worry about another layer in between the code and the machine.
    • b33j0r 14 hours ago
      This is the most surprising comment to me. It’s that bad? I haven’t benchmarked it myself.

      Zig has @Vector. This is a builtin, so it gets resolved at comptime. Is the problem with Rust here too much abstraction?

      • oasisaimlessly 14 hours ago
        I think you misinterpreted GP; he's saying that with some hints (explicit chunking with a branch on the chunk size), the compiler's auto-vectorization can handle the rest, inferring SIMD instructions in a manner that's 'good enough'.
  • mdriley 17 hours ago
    > TL;DR: use std::simd if you don’t mind nightly, wide if you don’t need multiversioning, and otherwise pulp or macerator.

    This matches the conclusion we reached for Chromium. We were okay with nightly, so we're using `std::simd` but trying to avoid the least stable APIs. More details: https://docs.google.com/document/d/1lh9x43gtqXFh5bP1LeYevWj0...

    • vlovich123 15 hours ago
      Do you compile the whole project with nightly or just specific components?
  • taeric 15 hours ago
    I'm curious on the uptake of SIMD and other assembly level usage through high level code? I'd assume most is done either by people writing very low level code that directly manages the data, or by using very high level libraries that are prescriptive on what data they work with?

    How many people are writing somewhat bog standard RUST/C and expect optimal assembly to be created?

    • jacquesm 10 hours ago
      I was heavily into assembly before I discovered C. For the first decade and half or so I could usually beat the compiler. Since then, especially when supporting multiple architectures I have not been able to do that unless I knew some assumption that the compiler was likely to make wasn't true. The 'const' keyword alone killed most of my hand optimized stuff.

      In the end the only bits where I resorted to assembly were the ones where it wouldn't make any sense to write stuff in C. Bootloaders, for instance, when all you have to work with is 512 bytes the space/speed constraints are much more on the space side and that's where I find I still have a (slight) edge. Which I guess means that 'optimal' is context dependent and that the typical 'optimal' defaults to 'speed'.

      • taeric 8 hours ago
        I think this is talking past my question? I don't necessarily think "low level" has to be "writing assembly." I do think it means, "knows the full layout of the data in memory." Something a surprising number of developers do not know.

        I've literally had debates with people that thought a CSV file would be smaller than holding the same data in memory. Senior level developers at both startups and established companies. My hunch is they had only ever done this using object oriented modeling of the data. Worse, usually in something like python, where everything is default boxed to hell and back.

        • jacquesm 8 hours ago
          I was really only responding to this part, apologies for not quoting it:

          > How many people are writing somewhat bog standard RUST/C and expect optimal assembly to be created?

          As for:

          > I don't necessarily think "low level" has to be "writing assembly." I do think it means, "knows the full layout of the data in memory." Something a surprising number of developers do not know.

          Agreed. But then again, there are ton of things a surprising number of developers don't know, this is just another one of those.

          Similar stuff:

          - computers always work

          - the CPU is executing my instructions one after the other in the same order in which they are written on the line(s)

          - the CPU is executing my instructions only one at the time

          - if I have a switch (or whatever that construct is called in $language) I don't need to check for values I do not expect because that will never happen

          - the data I just read in is as I expect it to be

          You can probably extend that list forever.

          Your CSV example is an interesting one, I can think of cases where both could be true, depending on the kind of character encoding used and the way the language would deal with such a character encoding. For instance in a language where upon reading that file all of the data would be turned into UTF-16 then, indeed, the in memory representation of a plain ASCII CSV file could well be larger than the input file. Conversely, if the file contained newlines and carriage returns then the in-memory representation could omit the CRs and then the in memory representation would be smaller. If you turn the whole thing into a data structure then it could be larger, or smaller, depending on how clever the data structure was and whether or not the representation would efficiently encode the values in the CSV.

          > My hunch is they had only ever done this using object oriented modeling of the data.

          Yes, that would be my guess as well.

          > Worse, usually in something like python, where everything is default boxed to hell and back.

          And you often have multiple representations of the same data because not every library uses the same conventions.

    • zamadatix 14 hours ago
      It's really only comparable to assembly level usage in the SIMD intrinsics style cases. Portable SIMD, like std::simd, is no more assembly level usage than calling math functions from the standard library.

      Usually one only bothers with the intrinsic level stuff for the use cases you're saying. E.g. video encoders/decoders needing hyper-optimized, per architecture loops for the heavy lifting where relying on the high level SIMD abstractions can leave cycles on the table over directly targeting specific architectures. If you're just processing a lot of data in bulk with no real time requirements, high level portable SIMD is usually more than good enough.

      • taeric 14 hours ago
        My understanding was that the difficulty with the intrinsics was more in how restrictive they are in what data they take in. That is, if you are trying to be very controlling of the SIMD instructions getting used, you have backed yourself into caring about the data that the CPU directly understands.

        To that end, even "calling math functions" is something that a surprising number of developers don't do. Certainly not with the standard high level data types that people often try to write their software into. No?

        • zamadatix 12 hours ago
          More than that: many of the intrinsics can be unsafe in standard Rust. This situation got much better this year but it's still not perfect. Portable SIMD has always been safe, because they are just normal high level interfaces. The other half is intrinsics are specific to the arch. Not only do you need to make sure the CPUs support the type of operation you want to do, but you need to redo all of the work to e.g. compile to ARM for newer MacBooks (even if they support similar operations). This is also not a problem using portable SIMD, the compiler will figure out how to map the lanes to each target architecture. The compiler will even take portable SIMD and compile it for a scalar target for you, so you don't have to maintain a SIMD vs non-SIMD path.

          By "calling math functions" I mean things like:

            let x = 5.0f64;
            let result = x.sqrt()
          
          Where most CPUs have a sqrt instruction but the program will automatically compile with a (good) software substitution for targets that don't. It's very similar with portable SIMD - the high level call gets mapped to whatever the target best supports automatically. Neither SIMD nor these kind of math functions work automatically with custom high level data types. The only way to play for those is to write the object to have custom methods which break it down to the basic types so the compiler knows what you want the complex type's behavior to be. If you can't code that then there isn't much you can do with the object, regardless of SIMD. With intrinsics you need to go a step further beyond all that and directly tell the compiler what specific CPU instructions should be used for each step (and make sure that is done safely, for the remaining unsafe operations).
          • taeric 8 hours ago
            I knew what you meant. My point was more that most people are writing software at the level of "if (overlaps(a, b)) doSomething()" Yes, there will be plenty of math and intrinsics in the "overlaps" after you get through all of the accessors necessary to have the raw numbers. But especially in heavily modeled spaces, the number one killer of getting to the SIMD is that the data just isn't in a friendly layout for it.

            Is that not the case?

  • waffletower 15 hours ago
    I am torn -- while I love the bitter critique of std::simd's nightly builds (why bother with any public release if it is never stable?), I cringed at the critique of "(c)urrently things are well fleshed out for i32, i64, f32, and f64 types". f64 and i64 go a long way for most numerical applications -- the OP seemed snowflaky to me with that entitled concern.
    • zamadatix 15 hours ago
      Not supporting other types (particularly smaller ones) can be quite limiting on portable SIMD, especially when it doesn't support AVX512 either, but those are certainly a good core group - just not the whole story. Regardless, I'm not sure how painting the OP as an entitled snowflake helps anything over just asking the question.
  • justahuman74 15 hours ago
    Somewhat related, does rust handle the riscv vector extension in a similar way to simd?
    • dzaima 13 hours ago
      Scalable vectors as in RVV & SVE aren't available in rust corrently; see https://github.com/rust-lang/rust/issues/145052

      (that said autovectorization should work, and fixed-width SIMD should map to RVV as best as possible, though of course missing out on perf if ran on wider-than-minimum hardware not on a native build)

  • brundolf 8 hours ago
    std::simd is so nice and easy to use, even as someone who's never done SIMD before. I wonder why it's stuck as nightly-only
  • IshKebab 16 hours ago
    > Fortunately, this problem only exists on x86.

    Also RISC-V, where you can't even probe for extension support in user space unfortunately.

    • dzaima 16 hours ago
      Linux of course does have an interface for RISC-V extension probing via hwprobe. And there's a C interface[1] for probing that's OS-agnostic (though it's rather new).

      [1]: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s...

    • raphlinus 16 hours ago
      It's not strictly x86 either, the other case you care about is fp16 support on ARM. But it is included in the M1 target, so really only on other ARM.
  • CashWasabi 14 hours ago
    I really dislike those articles that are language focused. Why not try to share them in a way that is language agnostic?
    • capyba 12 hours ago
      This article is specifically about the implementation of SIMD in Rust, not other languages.