Reflections on 30 years of HPC programming

(chapel-lang.org)

141 points | by matt_d 4 days ago

20 comments

  • jandrewrogers 1 day ago
    I can easily explain this, having worked in this space. The new languages don’t actually solve any urgent problems.

    How people imagine scalable parallelism works and how it actually works doesn’t have a lot of overlap. The code is often boringly single-threaded because that is optimal for performance.

    The single biggest resource limit in most HPC code is memory bandwidth. If you are not addressing this then you are not addressing a real problem for most applications. For better or worse, C++ is really good at optimizing for memory bandwidth. Most of the suggested alternative languages are not.

    It is that simple. The new languages address irrelevant problems. It is really difficult to design a language that is more friendly to memory bandwidth than C++. And that is the resource you desperately need to optimize for in most cases.

    • bruce343434 23 hours ago
      What does it mean to be friendly to memory bandwidth, and why does C++ excel at it, over, say, Fortran or C or Rust?
      • bayindirh 17 hours ago
        Actually, C, FORTRAN and C++ are friendly to memory bandwidth, written correctly.

        C++ is better than FORTRAN, because while it's being still developed and quite fast doing other things that core FORTRAN is good at is hard. At the end of the day, it computes and works well with MPI. That's mostly all.

        C++ is better than C, because it can accommodate C code inside and has much more convenience functions and libraries around and modern C++ can be written more concisely than C, with minimal or no added overhead.

        Also, all three languages are studied so well that advanced programmers can look a piece of code and say that "I can fix that into the cache, that'll work, that's fine".

        "More modern" programming languages really solve no urgent problems in HPC space and current code works quite well there.

        Reported from another HPC datacenter somewhere in the universe.

        • nine_k 9 hours ago
          I suppose that most HPC problems are embarrassingly parallel™, and have very little if any mutable shared state?
          • bradcray 9 hours ago
            I'd say that the opposite is more often the reality, which is why HPC systems tend to have high-bandwidth, low-latency networks.
            • nine_k 3 hours ago
              High bandwidth may mean the need to consult some very large but immutable data structure. As a trivial example, multiplying two matrices requires accessing each matrix fully multiple times over, but neither of them is altered in the process, so it can safely be done in parallel. Recording the result of a (naive) matrix multiplication can also be done without programmatic coordination, because each element is only updated once, independently from others.

              This is very unlike, say, a database engine, where mutations occur all the time and may come from multiple threads.

              Rust specifically makes it hard to impossible to clobber shared mutable state, e.g. to produce a dangling pointer. But this is not a problem that our matrix-multiplication example would have, so it won't benefit from being implemented in Rust. Maybe this applies to more classes of HPC problems.

              • godelski 2 hours ago
                The HPC infrastructure is not like you're used to using. It is very high bandwidth but latency is dependent on where your data lives. There's a lot more layers that complicate things and each layer has a very different I/O speed

                https://extremecomputingtraining.anl.gov/sites/atpesc/files/...

                Also how to handle the data can be very different. Just see how libraries like this work. They take advantage of those burst buffers and try to minimize what's being pulled from storage. Though there's a lot of memory management in the code people write to do all this complex stuff you need so that you aren't waiting around for disks... or worse... tape

                https://adios-io.org/applications/

          • stonogo 9 hours ago
            On the contrary. However, they tend to manually manage memory rather than outsourcing it to a language runtime or a distributed key-value store.
      • grg0 10 hours ago
        I'd say it's being able to structure your data however suits your problem and your hardware, then being able to look at a profile and being able to map read/writes back to source. Both C and C++ excel at this.

        The advantage of C++ over C is that, with care, you can write zero-cost abstractions over whatever mess your data ends up as, and make the API still look intuitive. C isn't as good here.

        • uecker 43 minutes ago
          According to my experience, all the "zero-cost abstractions" from C++ most of the time make it more annoying to maintain and/or understand the code, especially with respect to resource management, introduce compatibility issues at the toolchain level, and - even when looking perfect in toy benchmarks - are often not even zero-cost (e.g. all the bloat the templates generate often hurts).
        • nine_k 10 hours ago
          Is Fortran 90 not flexible enough in defining data layout?
      • lugu 21 hours ago
        Parent talks about new languages, as per the article Fortran or C doing fine. I speculate the benefit of C++ over Rust how it let programmers instruct the compiler of warranty that goes beyong the initial semantic of the language. See __restrict, __builtin_prefetch and __builtin_assume_aligned. The programming language is a space for conversations between compiler builders and hardware designers.
        • ozgrakkurt 17 hours ago
          It is just super unpleasant to write low level software in rust.

          There is a colossal ergonomics difference if you compare using clang vs rust to writing a hashmap for example.

          C compilers just have everything you can think of because everythin is first implemented there.

          Using anything else just seems kind of pointless. I understand new languages do have benefits but I don't believe language matters that much really.

          The person who writes that garbage pointer soup in C write Arc<> + multi threaded + macro garbage soup in Rust.

        • _flux 20 hours ago
          I believe __restrict, and __builtin_prefetch/__builtin_assume are compiler extensions, not part of the C++ language as is, and different compilers implement (or don't) these differently.

          The rust compiler actually has similar things, but they're not available in stable builds. I suppose there are some issues if principle why not to include them in stable. E.g: https://doc.rust-lang.org/std/intrinsics/fn.prefetch_read_da...

          Maybe some time in the future good acceptable abstractions will be conceived for them.. Perhaps using just using nightly builds for HPC is not that far out, though.

          • ameliaquining 17 hours ago
            Rust already has __restrict; it is spelled &mut and is one of the most fundamental parts of the language. The key difference, of course, is that it's checked by the compiler, so is useful for correctness and not just performance. Also, for a long time it wasn't used for optimization, because the corresponding LLVM feature (noalias) was full of miscompilation bugs, because not that much attention was being paid to it, because hardly anyone actually uses restrict in C or __restrict in C++. But those days are finally over.

            __builtin_assume is available on stable (though of course it's unsafe): https://doc.rust-lang.org/std/hint/fn.assert_unchecked.html

            There's an open issue to stabilize the prefetch APIs: https://github.com/rust-lang/rust/issues/146941 As is usually the case when a minor standard-library feature remains unstable, the primary reason is that nobody has found the problem urgent enough to put in the required work to stabilize it. (There's an argument that this process is currently too inefficient, but that's a separate issue.) In the meantime, there are third-party libraries available that use inline assembly to offer this functionality, though this means they only support a couple of the most popular architectures.

          • m_mueller 17 hours ago
            btw. Fortran is implicitly behaving as "restrict" by default, which makes sense together with its intuitive "intent" system for function/subroutine arguments. This is one of the biggest reasons why it's still so popular in HPC - scientists can pretty much just write down their equations, follow a few simple rules (e.g. on storage order) and out comes fairly performant machine code. Doing the same (a 'naive' first implementation) in C or C++ usually leads to something severely degraded compared to the theoretical limits of a given algorithm on given hardware.
            • _flux 17 hours ago
              Oh I actually had some editing mistake, I meant to say that also Rust has restrict by default, by virtue of all references being unique xor readonly.

              As I understand it, the Fortran compiler just expects your code to respect the "restrictness", it doesn't enforce it.

          • moregrist 16 hours ago
            restrict is in C99. I’m not sure why standard C++ never adopted it, but I can guess: it can be hard to reason about two restrict’d pointers in C, and it probably becomes impossible when it interacts with other C++ features.

            The rest are compiler extensions, but if you’re in the space you quickly learn that portability is valued far less than program optimization. Most of the point of your large calculations is the actual results themselves, not the code that got you there. The code needs to be correct and reproducible, but HPC folks (and grant funding agencies) don’t care if your Linux/amd64 program will run, unported, on Windows or on arm64. Or whether you’ve spent time making your kernels work with both rocm and cuda.

    • iamcreasy 12 hours ago
      Julia language is also used for HPC according to their webpage citing performance parity with C++. Would it be correct to infer Julia also provides the same level of memory bandwidth control?
      • Fronzie 12 hours ago
        It's close, but not quite. Getting it to do 'y = ax + by' like vector operations without superfluous reads/writes is tricky.
    • j4k0bfr 23 hours ago
      I'm pretty interested in realtime computing and didn't realise C++ was considered bandwidth efficient! Coming from C, I find myself avoiding most 'new' C++ features because I can't easily figure out how they allocate without grabbing a memory profiler.
      • bayindirh 17 hours ago
        You can always go through cachegrind or perf and see what happens with your code.

        I managed to reach practical IPC limits of the hardware I was running on, and while I could theoretically make prefetcher happier with some matrix reordering, looking back, I'm not sure how much performance it provided since the FPU was already saturated at that point.

      • GuB-42 13 hours ago
        C++ is like C with extra features, but you don't need to use them.

        If you want control over your memory, you can do pointers the C way, but you still have features like templates, namespaces, etc... Another advantage of C++ is that it can go both high and low level within the same language.

        Disadvantage of C++ is mostly related to portability and interop. Things like name mangling, constructors, etc... can be a problem. Also, C++ officially doesn't support some C features like "restrict". In practice, you often can use them, but it is nonstandard. Probably not a concern for HPC.

        • bch 12 hours ago
          > C++ is like C with extra features, but you don't need to use them

          C++ certainly (literally (Cfront[0])) used to be this, but I thought modern (decade or more) conventional wisdom is to NOT think like this anymore. Curious to hear others weigh in.

          [0] https://en.wikipedia.org/wiki/Cfront

          • GuB-42 11 hours ago
            To me, it is not "conventional wisdom", it is what a vocal group of C++ guys who look at Rust and its memory safety and don't want to be left out.

            Their way is not wrong, new constructs are indeed safer, more powerful, etc... But if you are only in for the new stuff, why use C++ at all, you are probably better off with Rust or something more modern. The strength of C++ is that it can do everything, including C, there is no "right" way to use it. If you need raw pointers, use raw pointers, if you need the fancy constructs the STL provides, use them, these are all supported features of the language, don't let someone else who may be working in a completely different field tell you that you shouldn't use them.

            • ablob 9 hours ago
              C++ by comparison doesn't stand in your way too much either. I feel like the biggest gripe Rust has is what happens when you do have to go unsafe. That seems to be a strong point of contention for many folks. Maybe all the reasons that lead people to use unsafe rust go away or the attitude about it shifts in some manner.

              For me Rust turned out to be less interesting after I saw the whole ceremony about typing. The amount of things I had to grasp just to get a glimpse into what a library does felt much more involved than any of the things I did with C++. The whole annotation-ting feels much less necessary and more like a proper opt-in there.

      • grg0 10 hours ago
        C++ comes with baggage and requires up-front training. You need to dive into every language feature and STL library, learn how compilers implement stuff, then decide what to use and what not to, and the decision often depends on context. It has a high cognitive load in my opinion for that reason. But once you do that, you get a relatively high-level language that can go as low and be as fast as C.
      • Narishma 22 hours ago
        I don't think there's much difference between C and C++ (and Rust, etc...) when it comes to this.
        • formerly_proven 20 hours ago
          Idiomatic/natural rust tends to be a lot heavier on allocations and also physically moving objects around than the other two.
          • kmaitreys 19 hours ago
            Can you elaborate on this? Slightly concerned because I have written (and planning to write more) Rust HPC code
            • Joeboy 19 hours ago
              Maybe not what they meant, but Rust sometimes makes it tempting to just copy things rather than fighting the borrow checker. Whereas in C++ you're free to just pass pointers around and not worry about it until / unless your code crashes or gets exploited.

              Speaking authoritatively from my position as an incompetent C++ / Rust dev.

              • kmaitreys 17 hours ago
                I see. Fortunately, I'm aware of that and I don't use clone (unless I intend to) as much. Borrow checker is usually not a problem when writing scientific/HPC code.

                Because passing pointers isn't as ergonomic in Rust, I do things in arena-based way (for example setting up quadtrees or octrees). Is that part of the issue when it comes to memory bandwidth?

            • zozbot234 18 hours ago
              Stable Rust doesn't have a local allocator construct yet, you can only change the global allocator or use a separate crate to provide a local equivalent.
              • kmaitreys 17 hours ago
                Right. I have seen Zig where one needs to specify allocators as well. I'm sorry I'm not well versed enough to know how it makes things better for HPC though?

                For now my plan is to write fairly similar style code as one may write in C++/Fortran through MPI bindings in Rust.

                • convolvatron 17 hours ago
                  if you're using thread level parallelism, there is always a benefit to having a per-thread allocator so that you don't have to take global locks to get memory, they become highly contended.

                  if you take that one step further and only use those objects on a single core, now your default model is lock-free non-shared objects. at large scale that becomes kind of mandatory. some large shared memory machines even forgo cache consistency because you really can't do it effectively at large scale anyways.

                  but all of this is highly platform dependent, and I wouldn't get too wrapped up around it to begin with. I would encourage you though to worry first about expressing your domain semantics, with the understanding that some refactoring for performance will likely be necessary.

                  if you have the patience and personally and within the project, it can be a lot of fun to really get in there and think about the necessary dependencies and how they can be expressed on the hardware. there's a lot of cool tricks, for example trading off redundant computation to reduce the frequency of communication.

                  • kmaitreys 16 hours ago
                    Thank you for such a great reply!

                    There's a lot of useful advice here that'll surely come in handy to me later. For now, yeah I'm just going to try to make things work. So far I have mostly written intra-node code for which rayon has been adequate. I haven't gotten around to test the ergonomics of rs-mpi. But it feels quite an exciting prospect for sure.

        • Joel_Mckay 21 hours ago
          There is unless using a llvm compiler that does naive things with code motion.

          Rust is typically slowest (often negligible <3%), C++ has better CUDA support, and C can be heavily optimized with inline assembly (very unforgiving to juniors.)

          Also, heavily associated with coding style =3

          https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...

      • Joel_Mckay 23 hours ago
        • j4k0bfr 18 hours ago
          I'm talking almost exactly about this haha. Flight software representation!! Although I have no experience programming FPGAs, I hope to gain some soon. They seem like the ultimate solution to my IO woes.
          • Joel_Mckay 18 hours ago
            A bit legacy these days, but I liked the old Zynq 7020 dual core ARM with reasonable LUT counts.

            https://www.youtube.com/watch?v=FujoiUMhRdQ

            https://github.com/Spyros-2501/Z-turn-Board-V2-Diary

            https://www.youtube.com/@TheDevelopmentChannel/playlists

            https://myirtech.com/list.asp?id=708

            The Debian Linux example includes how to interface hardware ports.

            Almost always better to go with a simpler mcu if one can get away with it. Best of luck =3

            • jeffreygoesto 13 hours ago
              That chip was hitting a sweet spot in terms of DRAM controller and distributing memory bandwidth between CPU cores and fabric. Xilinx was very afraid of screwing this up and running into bottlenecks. One of the best balanced chips in that regard with a great controller. Your best bet still was to keep everything in blockram as much as possible and only read and write DRAM once at every end of the computation...
            • j4k0bfr 18 hours ago
              No way, was looking at the Z-7045 for a work project literally today. And yep I agree, simpler solutions have simpler problems lol. Thanks for the recommendation, I'll give it a look!
    • Joel_Mckay 23 hours ago
      > C++ is really good at optimizing for memory bandwidth

      In general, most modern CPU thread-safe code is still a bodge in most languages. If folks are unfortunate enough to encounter inseparable overlapping state sub-problems, than there is no magic pixie dust to escape the computational cost. On average, attempting to parallelize this type of code can end up >30% slower on identical hardware, and a GPU memory copy exchange can make it even worse.

      Sometimes even compared to a large multi-core CPU, a pinned-core higher clock-speed chip will win out for those types of problems.

      Thus, the mystery why most people revert to batching k copies of single-core-bound non-parallel version of a program was it reduces latency, stalls, cache thrashing, i/o saturation, and interprocess communication costs.

      Exchange costs only balloon higher across networks, as however fast the cluster partition claims to be... the physics is still going to impose space-time constraints, as modern data-centers will spend >15% of energy cost just moving stuff around networks for lower efficiency code.

      I like languages like Julia, as it implicitly abstracts the broadcast operator to handle which areas may be cleanly unrolled. However, much like Erlang/Elixir the multi-host parallelization is not cleanly implemented... yet...

      The core problem with HPC software, has always been academics are best modeled like hermit-crabs with facilities. Once a lucky individual inherits a nice new shell, the pincers come out to all smaller entities who may approach with competing interests.

      Best of luck, =3

      "Crabs Trade Shells in the Strangest Way | BBC Earth"

      https://www.youtube.com/watch?v=f1dnocPQXDQ

    • convolvatron 15 hours ago
      I worked in parallel computing in the late 80s and early 90s when parallel languages were really a thing. in HPC applications memory bandwidth is certainly a concern, although usually the global communications bandwidth (assuming they are different) is the roofline. by saying c++ you're implying that MPI is really sufficient, and its certainly possible to prop up parallel codes with MPI is really quite tiresome and hard to play with the really interesting problem which is the mapping of the domain state across the entire machine.

      other hugely important problems that c++ doesn't address are latency hiding, which avoids stalling out your entire core waiting for distributed message, and a related solution which is interleave of computation and communication.

      another related problem is that a lot of the very interesting hardware that might exist to do things like RDMA or in-network collective operations or even memory-controller based rich atomics, aren't part of the compiler's view and thus are usually library implementations or really hacky inlines.

      is there a good turnkey parallel language? no. is there sufficient commonality in architecture or even a lot of investment in interesting ideas that were abandoned because of cost, no. but there remains a huge potential to exploit parallel hardware with implicit abstractions, and I think saying 'just use c++' is really missing almost all of the picture here.

      addendum: even if you are working on a single-die multicore machine, if you don't account for locality, it doesn't matter how good your code generator is, you will saturate the memory network. so locality is an important and languages like Chapel are explicitly trying to provide useful abstractions for you to manage it.

    • suuuuuuuu 21 hours ago
      If you think C++ is the best here, then I don't think you've actually worked in this space nor appreciated the actual problems these languages try to solve. In particular because you can't program accelerators with C++.

      Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.

      Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.

      • pjmlp 20 hours ago
        What?!?

        NVidia designs CUDA hardware specifically for the C++ memory model, they went through the trouble to refactor their original hardware across several years, so that all new cards would follow this model, even if PTX was designed as polyglot target.

        Additionally, ISO C++ papers like senders/receivers are driven by NVidia employees working on CUDA.

        • suuuuuuuu 17 hours ago
          CUDA is not C++. CUDA for GPU kernels is its own language. That's the actual problem requiring new languages or abstractions.
          • pjmlp 17 hours ago
            Says those that don't know CUDA.

            You can program CUDA in standard C++20, with CUDA libraries hidding the language extensions.

            I love when C and C++ dialects are C and C++ when it matters, and not when it doesn't help to sell the ideas being portrayed.

            • suuuuuuuu 17 hours ago
              Sorry, I wasn't aware of these developments (having abandoned CUDA for hardware-agnostic solutions before 2020). It doesn't change my point anyway, if it's specific to a single vendor.

              I'm extremely dubious that such an opaque abstraction can actually solve the (true) problem. "Not having to write CUDA" is not enough - how do you tune performance? Parallelization strategies, memory prefetching and arrangement in on-chip caches, when to fuse kernels vs. not... I don't doubt the compiler can do these things, but I do doubt that it can know at compile time what variants of kernel transformations will optimize performance on any given hardware. That's the real problem: achieving an abstraction that still gives one enough control to achieve peak performance.

              Edit: you tell me if I'm wrong, but it seems that std::par can't even use shared memory, let alone let one control its usage? If so, then my point stands: C++ is not remotely relevant. Again, avoiding writing CUDA (etc.) doesn't solve the real problem that high-performance language abstractions aim to address.

              • ablob 9 hours ago
                So what would be such an HPC language that you're so fond of? A quick web search reveals only languages that use C++/CUDA code as a back end (python), are new and experimental (Julia) or FORTRAN. For what you're talking about none seem all to good, so you've peaked my curiosity.
            • bjourne 9 hours ago
              If CUDA is C++ then I'd like to know how you throw and catch exceptions in CUDA kernels.
              • pjmlp 2 hours ago
                The same way as people writing C++ code as Google employees do, including LLVM and Chrome.
      • fcanesin 20 hours ago
        Wait what!? I have been programming CUDA since 2009 and specifically remember it being pushed to C++ as main development language for the first few years, after a brief "CUDA C extension" period.
        • adrian_b 14 hours ago
          CUDA variants extend several programming languages, including C, C++ and Fortran.

          None of the extended languages is the same as the same as the base language, in the same way like OpenMP C++ in not the same as C++ or OpenMP Fortran is not the same as Fortran or SYCL is not the same as C++.

          The extended languages include both extensions and restrictions of the base language. In the part of a program that will run on a GPU you can do things that cannot be done in the base language, but there also parts of the base language, e.g. of C++, which are forbidden.

          All these extended languages have the advantage that you can write in a single source file a complete multithreaded program, which has parts running concurrently on a CPU and part running concurrently on a GPU, but for the best results you must know the rules that apply to the language accepted by each of them. It is possible to write program that run without modification on either a CPU or a GPU, but this is paid by a lower performance on any of them, because such a program uses only generic language features that work on any of them, instead of taking advantage of specific features.

        • suuuuuuuu 17 hours ago
          CUDA is not C++. CUDA for GPU kernels is its own language. That's the actual problem requiring new languages or abstractions.
  • Xcelerate 20 hours ago
    I wonder how much of the programming language problem is due to churn of the user base. Looking over many comments in this thread, I see “Oh, back when I did HPC...” I used Titan for my own work back in 2012. But after my PhD, I never touched HPC again. So the people writing the code use what’s there but don’t stay long enough to help incentivize new or better languages. Now on the hardware side (e.g., design of interconnects), that more commonly seems to be a full career.

    The other issue is that to really get the value out of these machines, you sort of have to tailor your code to the machine itself to some degree. The DOE likes to fund projects that really show off the unique capabilities of supercomputers, and if your project could in principle be done on the cloud or a university cluster, it’s likely to be rejected at the proposal stage. So it’s sort of “all or nothing” in the sense that many codebases for HPC are one-off or even have machine-specific adaptations (e.g., see LAMMPS). No new general purpose language would really make this easier.

    • kinow 14 hours ago
      Churn of the user base could be playing a role in this, but I think it may not be too significant. In Europe there are multiple universities with HPC masters, which provide new users/devs to HPC. I worked with HPCs in New Zealand, and now I am doing the same in Spain. We hire multiple people from other HPC centers in Germany/UK/Italy, and equally lose people to those sites.

      I think the field is actually increasing with AI, digital twins, more industry projects (CFD, oil, models for fisheries, simulations for health diseases, etc.).

  • jpecar 23 hours ago
    All these fancy HPC languages are all nice and dandy, but the hard reality I see on our cluster is that most of the work is done in Python, R and even Perl and awk. MPI barely reached us and people still prefer huge single machines to proper distributed computing. Yeah, bioinformatics is from another planet.
    • jltsiren 22 hours ago
      Bioinformatics is an outlier within HPC. It's less about numerical computing and more about processing string data with weird algorithms and data structures that are rarely used anywhere else.

      Distributed computing never really took off in bioinformatics, because most tasks are conveniently small. For example, a human genome is small enough that you can run most tasks involving a single genome on an average cost-effective server in a reasonable time. And that was already true 10–15 years ago. And if you have a lot of data, it usually means that you have many independent tasks.

      Which is nice from the perspective of a tool developer. You don't have to deal with the bureaucracy of distributed computing, as it's the user's responsibility.

      C++ is popular for developing bioinformatics tools. Some core tools are written in C, but actual C developers are rare. And Rust has become popular with new projects — to the extent that I haven't really seen C++20 or newer in the field.

      • zozbot234 18 hours ago
        Bioinformatics is also seeing huge gains from rewriting the slow Python code into highly parallel Rust (way less fiddly than C++ for the typical academic dev).
        • calvinmorrison 17 hours ago
          This is not new either. Most of numpy and pandas and other stuff you use the Python C interface and pass arrays in and get data back. You can write small embeddable C libraries pretty easily for real crunching and you get the ease of writing python (basically comprehensible to researchers who understand The MATLAB )
          • uecker 37 minutes ago
            I would say the least thing the scientific community needs is the packaging mess of Python introduced also on the lower level via Rust.
    • jpecar 23 hours ago
      To add on this, what I see gaining traction are "workflow managers", tools that let people specify flow of data through various tools. These can figure out how to parallelize things on their own so users are not burdened with this task.

      So from what I see actual programming language doesn't matter as much as how the work is organized. Anything helping people simplify this task is of immediate benefit to the science.

      • jkh1 22 hours ago
        Most of the time in bio-related fields, we need high-throughput computing not high-performance computing.
      • JohnWabwire 21 hours ago
        [dead]
    • bluedino 17 hours ago
      Python is huge in AI/ML etc as well.

      I haven't talked to anyone writing C++ code on a HPC cluster that I'm working on in a long, long time. And that's in industrial/chemical/automotive fields.

  • riffraff 1 day ago
    Perhaps one issue lacking discussion in the article is how easy it is to find devs?

    I've never worked in HPC but it seems it should be relatively simple to find a C/C++ dev that can pick up OpenMP, or one that already knows it, compared to hiring people who know Chapel.

    The "scaling down" factor (how easy or interesting it is to use tool X for small use) seems a disadvantage of HPC-only languages, which creates a barrier to entry and a reduction in available workforce.

    • KaiserPro 22 hours ago
      I worked in HPC adjacent fields for a while (up until 40gig ethernet was cheap enough to roll out to all the edge nodes)

      There are a couple of big things that are difficult to get your head around:

      1) when and where to dispatch and split jobs (ie whats the setup cost of spinning up n binaries on n machines vs threading on y machines)

      2) data exchange primitives, Shared file systems have quirks, and a they differ from system to system. But most of the time its better/easier/faster to dump shit to a file system than some fancy database/object store. Until its not. Distributed queues are great, unless you're using them wrong. Most of the time you need to use them wrong. (the share memory RPC is a whole another beast that fortunatly I've never had to work with directly. )

      3) dealing with odd failures. As the number of parallel jobs increase the chance of getting a failure reaches 1. You need to bake in failure modes at the very start.

      4) loading/saving data is often a bottle neck, lots of efficiecny comes from being clever in what you load, and _where_ you load it. (ie you have data affinity, which might be location based, or topology based, and you don't often have control over where your stuff is placed.)

    • kinow 23 hours ago
      I think hpc devs need an extra set of skills that are not so common. Such as parallel file systems, batch schedulers, NUMA, infiniband, and probably some domain-specific knowledge for the apps they will develop. This knowledge is also probably a bit niche, like climate modelling, earthquake simulation, lidar data processing, and so it goes.

      And even knowing OpenMP or MPI may not suffice if the site uses older versions or heterogeneous approaches with CUDA, FPGA, etc. Knowing the language and the shared/distributed mem libs help, but if your project needs a new senior dev than it may be a bit hard to find (although popularity of company/HPC, salary, and location also play a role).

      • physicsguy 22 hours ago
        You tend to only learn these things as they become a problem too. That's super super domain specific and it doesn't always translate between areas of research.

        So for e.g. when I did HPC simulation codes in magnetics, there was little point focusing on some of these areas because our codes were dominated by the long-range interaction cost which limited compute scaling. All of our effort was tuning those algorithms to the absolute max. We tried heterogenous CPU + GPU but had very mixed results, and at that time (2010s) the GPU memory wasn't large enough for the problems we cared about either.

        I then moved to CFD in industry. The concerns there were totally different since everything is grid local. Partitioning over multi-GPU is simple since only the boundaries need to be exchanged on each iteration. The problems there were much more on the memory bandwidth and parallel file system performance side.

        Basically, you have to learn to solve whatever challenges get thrown up by the specific domain problem.

        > And even knowing OpenMP or MPI may not suffice if the site uses older versions

        To be fair, you always have the option of compiling yourself, but most people I met in academia didn't have the background to do this. Spack and EasyBuild make this much much easier.

      • bluedino 17 hours ago
        Most developers have no clue about any of that stuff. It's all abstracted out.
  • saltcured 9 hours ago
    I was a student intern in a parallel computing research group around that first reference point of 1995. My career went other ways, working more on distributed systems instead of programming language theory or implementation.

    But, when I encountered OpenCL and CUDA about ten years ago, I was struck by just how much these were delivering the SPMD parallel programming model in finished products. Around 1995, these were often C dialects with some wonky compiler that each research group just barely kept together. By 2015, they were just bundled up inside a graphics driver or similarly commoditized runtime environment.

    Also, the GPU of 2015 was delivering the throughput we dreamed of in supercomputers back then. A teraFLOP went from a strategic theme to something you could deploy to your desktop.

  • hpcdude 16 hours ago
    HPCdude here, and this is a mostly correct article, but here are what it misses:

    1) It mentions in passing the hardware abstraction not being as universal as it seemed. This is more and more true, once we started doing fpgas, then asics, and as ARM and other platforms starting making headway, it fractured things a bit.

    GPUs too: I'm still a bit upset about CUDA winning over OpenCL, but Vulkan compute gives me hope. I haven't messed with SYCL but it might be a future possibility too.

    2) The real crux is the practical, production interfaces that HPC end users get. Normally I'm not exposing an entire cluster (or sub cluster) to a researcher. I give them predefined tools which handle the computation across nodes for them (SLURM is old but still huge in the HPC space for a reason! When I searched the article for "slurm" I got 0 hits!) When it comes to science, reproducibility is the name of the game, and having more idempotent and repeatable code structures are what help us gain real insights that others can verify or take and run with into new problem/solutions. Ad-hoc HPC programming doesn't do that well, like existing languages slurm and other orchestration layers handle.

    Sidenote: One of the biggest advances recently is in RDMA improvements (remote direct memory access), because the RAM needs of these datasets are growing to crazy numbers, and often you have nodes being underutilized that are happy to help. I've only done RoCE myself though and not much with Infiniband, (sorry Yale, thats why I flubbed on the interview) but honestly, I still really like RoCE for cluster side and LACP for front facing ingress/egress.

    The point is existing tooling can be massaged and often we don't need new languages. I did some work with Mellanox/Weka prior to them being bought by Nvidia on optimizing the kernel shims for NFSv4 for example. Old tech made fast again.

    • ablob 9 hours ago
      afaik, you always have to break open abstractions for more performance. If you ignore cache-levels in your program you're gonna have a bad time - and depending on the system the layout (and with it how you should use it) is different. The same is true for how machines are interconnected. Depending on the wiring you have different throughput-values when sharing data between nodes. The whole area screams "not universal" to me.
  • pklausler 19 hours ago
    Honestly, if a language can't succeed in HPC alongside (or against) Fortran with its glacial rate of buggy evolution and poor track record of portability, and C++ with its never-ending attempts at parallelism, then it's not what HPC needs.

    (What HPC does need, IMNSHO, is to disband or disregard WG5/J3, get people who know what they're doing to fix the features they've botched or neglected for thirty years, and then have new procurements include RFCs that demand the fixed portable Fortran from system integrators rather than the ISO "standard".)

  • swiftcoder 23 hours ago
    It's interesting that none of the actor-based languages ever made it into this space. Feels like something with the design philosophy of Erlang would be pretty suitable to exploit millions of cores and a variety of interconnects...
    • jacquesm 22 hours ago
      That will never happen. The overhead is massive.
      • Joel_Mckay 22 hours ago
        People did try to create an OTP in HDL at one point.

        And Erlang has already run many telecom infrastructures for decades. Surprising given how fragile the multi-host implementation has proven.

        Erlang/Elixir are neat languages, and right next to Julia for fun. =3

        • jacquesm 18 hours ago
          Yes, it's absolutely amazing. But I'm intimately familiar with how the Erlang VM works and it would seem to me to be a very bad match for HPC on the number crunching side, though it likely would do quite well on the orchestration side but that would require the people writing the rest of the code to change their way of working completely. And given how much of that is still F77 I highly doubt they would be willing to make that investment without the promise of some massive gain.
          • Joel_Mckay 17 hours ago
            For simple AMQP, it has performed rather well for our use-cases over the years.

            Haven't personally deployed this version yet. ymmv =3

            https://www.rabbitmq.com/docs/quorum-queues

            • jacquesm 16 hours ago
              Of course it does, that's an ideal usecase. But the topic is a different kettle of fish.

              And I'm saying that as a complete fan of Erlang, it is one of the few pieces of software 'done right' that I'm aware of. Unfortunately that comes at the price of being unsuitable for highly optimized number crunching unless you want to break out the C-compiler and write an extension.

              Python is similar, but there the extensions have been grafted on so well lots of people forget that they are not part of the language itself. In the Erlang world you'd have a lot of leaks and conversions to make something like that work and it would likely never be as transparent as python, which in many ways is both the new BASIC and the new FORTRAN.

        • RandomTeaParty 15 hours ago
          Erlang is about reliability, hpc is about performance (literally in the name)
          • Joel_Mckay 13 hours ago
            Most HPC job queue cluster partition batching I saw was stone-age primitive by comparison. =3
  • chatmasta 20 hours ago
    HPC is heavily skewed toward academia, and it doesn’t have a lot of overlap with compiler nerds. I think this explains it.
  • nnevatie 20 hours ago
    Usually a new language is facing the ecosystem mass issue - the previously used language, e.g., C++ has already the critical mass with available libraries and frameworks. Getting to the same level of ecosystem maturity with a new language will take a long time, as seen with Rust.
  • rramadass 2 hours ago
    Relevant:

    The Art of High Performance Computing (a comprehensive series of textbooks) - https://theartofhpc.com/

    Previous discussion - https://news.ycombinator.com/item?id=38815334

  • DamonHD 22 hours ago
    I used to edit an HPC trade rag in the early 90s, so this was an interesting read!
  • RhysU 21 hours ago
    > we have failed to broadly adopt any new compiled programming languages for HPC

    The article neglects that all of C, C++, and Fortran have evolved over the last 30 years.

    Also, you'll find significant advances in the HPC library ecosystem over the trailing years. Consider, for example, Trilinos (https://trilinos.github.io/index.html) or Dakota (https://dakota.sandia.gov/about-dakota/) both of which push a ton of domain-agnostic capabilities into a C++ library instead of bolting them into a bespoke language. Communities of users tend to coalesce around shared libraries not creating new languages.

    • bradcray 14 hours ago
      The evolution of C, C++, and Fortran is touched on in a sidebar, although admittedly very briefly:

      > Champions of Fortran, C++, MPI, or other entries on this list could argue that…

    • pjmlp 20 hours ago
      The authors are aware, as the Chapel compiler makes use of LLVM.
      • RhysU 18 hours ago
        The author's framing "we have failed" suggests otherwise.

        This section, https://chapel-lang.org/blog/posts/30years/#ok-then-why, does not mention libraries at all.

        • pjmlp 17 hours ago
          Not really, you should actually read that section a few times as well.

          > A fact of life in HPC is that the community has many large, long-lived codes written in languages like Fortran, C, and C++ that remain important. Such codes keep those languages at the forefront of peoples’ minds and sometimes lead to the belief that we can’t adopt new languages.

          > In large part because of the previous point, our programming notations tend to take a bottom-up approach. “What does this new hardware do, and how can we expose it to the programmer from C/C++?” The result is the mash-up of notations that we have today, like C++, MPI, OpenMP, and CUDA. While they allow us to program our systems, and are sufficient for doing so, they also leave a lot to be desired as compared to providing higher-level approaches that abstract away the specifics of the target hardware.

          Nothing there suggests the languages don't improve, especially anyone that follows ISO knows where many of improvements to Fortran, C and C++ are coming from.

          For example, C++26 is probably going to get BLAS into the standard library, senders/receivers is being sponsored by CUDA money.

          Another thing you missed from the author background, is that Chapel is sponsored by HPE and Intel, and one of the main targets are HPE Cray EX/XC systems, they know pretty well what is happening.

          • SiempreViernes 16 hours ago
            The fact that the author is a developer of Chapel pretty neatly explains why "no new language was adopted" is valued as failure, the article itself makes little effort to argue for that value judgment.
            • bradcray 14 hours ago
              Author here: I didn't go into more detail on this than https://chapel-lang.org/blog/posts/30years/#maybe-hpc-doesnt... because I felt like the article was long enough already and that I'd recently covered that topic in detail in this series https://chapel-lang.org/blog/series/10-myths-about-scalable-... summarized here https://chapel-lang.org/blog/posts/10myths-part8/#summary
              • SiempreViernes 11 hours ago
                In the "maybe we don't need it" you open up with this:

                > Another explanation might be that HPC doesn’t really need new languages; that Fortran, C, and C++ are somehow optimal choices for HPC. But this is hard to take very seriously given some of the languages’ demerits

                It's honestly hard to think of a less specific claim than "some of [their] demerits", this is clearly preaching to the choir territory. Later hints of substance appear, but the text is merely reminding the reader of something they are expected to already know.

                Moving on, the summary for the "ten myths" series starts with:

                > I wrote a series of eight blog posts entitled “Myths About Scalable Parallel Programming Languages” [...] In it, I described discouraging attitudes that our team encountered when talking about developing Chapel, and then gave my personal rebuttals to them.

                So it appears to be a text about the trouble of trying to break through with a new "HPC" language, and the reader is again expected to already know the (potentially very good) technical reasons for why one would want to create a new one.

                • bradcray 9 hours ago
                  Good point on my alluding to demerits of Fortran, C, and C++ without stating them, and thanks for clarifying your criticism. Using the four factors that I focused on as attractive features in new languages:

                  Productivity: For me, while Fortran has some nice features for HPC (multidimensional arrays), lots about its design feels very old-fashioned to my (not particularly young) eyes. C and C++ are more "my generation" of programming language, so are familiar and comfortable, yet they still seem verbose, convoluted, and less readable (more symbolically oriented) as compared to Python, Julia, or Swift, which are more what I'm looking for in terms of productivity these days. Of the three, C++ has clearly made the biggest strides in recent years to improve productivity, with some successes in my opinion, though I've also had a hard time keeping up with all the changes.

                  Safety: I consider C and C++ to be fairly unsafe languages compared to more modern alternatives. I don't have enough experience with Fortran to have a particularly informed opinion, but feel as though I've been aware of patterns in the past that have felt unsafe. Here again, I think using modern C++ in a certain style (e.g., smart pointers) probably makes nice strides w.r.t. safety, but I'd still consider there to be a gap between it and Python/Rust (as does my colleague in this post: https://chapel-lang.org/blog/posts/memory-safety/)

                  Portability: Modulo the degree to which various compilers keep up with the latest standards in Fortran and C++, I'd consider all three languages to be quite portable.

                  Performance: There's no question that these are high-performing languages in the sequential computing setting. In HPC, while Fortran or C++ and MPI are often considered the gold standard, it's a standard that can be beat if your language maps more natively to the network's capabilities, or knows how to optimize for distributed memory computing rather than relying on the programmer to do it themselves.

                  With respect to the "10 myths" series, while the focus of the series was about combatting prevalent negative attitudes about new languages in the HPC community, I think there's a lot of content along the way that rationalizes the value of creating new languages in my rebuttals. That said, I fully realize that it's a long read, particularly in its updated "Redux" form.

                  Thanks again for clarifying your previous point.

            • pjmlp 15 hours ago
              > I didn’t put Chapel on my list of broadly adopted HPC programming notations above, in large part to avoid being presumptuous. But it’s also because, regrettably, I don’t consider Chapel’s support within the community to be as solid as the others on my list
  • ivell 19 hours ago
    I think Mojo has a good chance to become suitable for HPC.
    • rirze 12 hours ago
      I don't understand why people are so excited for Mojo. I don't get the impression it will replace anything but computational scripting that was done previously in Python.

      HPC is a different beast as far as I'm aware.

      • convolvatron 11 hours ago
        I don't think that's accurate. Mojo is explicitly compiling tensor graphs that run on accelerators. it's not like PyTorch where python is providing the chassis but not the engine.

        I don't think its going to be a good general HPC language just because its targeting a specific set of AI workloads, but they have shown some examples of synthesizing code which is comparable to hand-written kernels.

        but its not out of the question from first principles

  • guywithahat 14 hours ago
    I like the idea of chapel, but I'm not sure I agree with a lot of their design choices. Some of the parallelization features seem like they just copied OpenMP without meaningfully improving on it. They also kept exceptions, which are generally on their way out, especially in compiled languages (Go, Rust, Zig, and while they exist in modern C++ they are introducing more ways to not use them). I think a new HPC language is possible but I'm not sure this is the one
  • shevy-java 19 hours ago
    > Could the reason be that language design is dead, as was asserted by an anonymous reviewer on one of our team’s papers ~30 years ago?

    It may not be dead, but it seems much harder for languages to gain adoption.

    I think there are several reasons; I also suspect AI contributes a bit to this.

    People usually specialize in one or two language, so the more languages exist, the less variety we may see with regards to people ACTUALLY using the language. If I would be, say, 15 years old, I may pick up python and just stick with it rather than experiment and try out many languages. Or perhaps not even write software at all, if AI auto-writes most of it anyway.

  • crabbone 22 hours ago
    As someone who worked for a while and still works in HPC, my impression from this field as compared to eg. programming in finance sector or programming for storage sector is that... HPC is so backwards and far behind, it's really amazing how it's portrayed as some sort of a champion of the field.

    That's not to say that new things don't happen there, it's just that I find a lot of old stuff that was shown to be bad decades ago still being in vogue in HPC. Probably because it's a relatively small field with a lot of people there being academics and not a lot of migration to/from other fields.

    You've probably never heard of `module` (either Tcl or Lmod). This is a staple of HPC world. What this thing does is it sources or (tries to) remove some shell variables and functions into the shell used either interactively or by a batch job. This is a beyond atrocious idea to handle your working environment. The information leaks, becomes stale, you often end up loading the wrong thing into your environment. It's simply amazing how bad this thing is. And yet, it's just everywhere in HPC.

    Another example: running anything in HPC, basically, means running Slurm batch jobs. There are alternatives, but those are even worse (eg. OpenPBS). When you dig into the configuration of these tools, you realize they've been written for pre-systemd Linux and are held together by a shoestring of shell scripting. They seldom if at all do the right thing when it comes to logging or general integration with the environment they run in. They can be simultaneously on the bleeding edge (eg. cgroup integration or accelerator driver integration) and be completely backwards when it comes to having a sensible service definition for systemd (eg. try to manage their service dependencies on their own instead of relying on systemd to do that for them).

    In other words, imagine a steam-punk world, but now it's in software. That's sort of how HPC feels like after a decade or so in more popular programming fields.

    Also, a lot of code written for HPC is written the way it is not because the writer chose the language or the environment. The typical setup is: university IT created a cluster with whatever tools they managed to put there eons ago, and you, the code writer, have to deal with... using CentOS6 by authenticating to university's AD... in your browser... through JupyterLab interface. And there's nothing you can do about it because the IT isn't there, is incompetent to the bone and as long as you can get your work done somehow, you'd prefer that over fighting to perfect your toolchain.

    Bottom line, unless a language somehow becomes indispensable in this world, no matter its advantages, it's not going to be used because of the huge inertia and general unwillingness to do beyond the minimum.

    • pama 20 hours ago
      HPCs never loved the inefficiencies of anything virtualized (VMs or any containers really), so the shell hacks of module enabled a (limited, but workable) level of reproducibility that was sufficiently composable and usable by researchers who understood the shell. I am not going to defend this tcl hack any further, but I can see how it was the path of least resistance when people tried to stay close to the raw metal of their large clusters while keeping some level of sanity. Slurm is a more defensible choice, but I agree that these tools are from a different era of compute. I grew to love and hate these tools, but they definitely represent an acquired taste, like a dorian fruit; not like an apple.

      Your centos6 references made me chuckle :-)

      • zozbot234 18 hours ago
        Containers are an OS sandboxing/namespacing primitive, they don't involve any overhead on their own. The overhead is dependent on what's inside the container besides a single deployed binary.
      • pphysch 16 hours ago
        I promise you that the main reason HPC is behind on virtualization is not because of the little bit of overhead. There are a dozen other inefficiencies in the average HPC workload that are more significant.

        Most centers don't even have good real-time observability systems to diagnose systemic inefficiencies, leaving application/workload profiling purely up to user-space.

        The HP in HPC has really been watered down over the last couple decades, and "IT for computational research" would be a more accurate name. You can do genuinely high-performance computing there, but you'll be an outlier.

        • saltcured 9 hours ago
          It's a mixture of legacy and reality.

          For one, the assumption has been that you had dedicated use of all the nodes and communication network. It would kill your performance if your local node CPU scheduler was interfering with having your actual HPC program active when the messages were coming in from its peer tasks on the other nodes, since parallel jobs are limited in the end by the critical path latency of the cross-node communications.

          It's only on the most "embarrassingly parallel" end of the spectrum where you can tolerate a bunch of virtualization and non-determinism, because the tasks communicate so infrequently or via such asynchronous mechanisms that they don't really impact the throughput of the whole job if they are asleep at random times.

          But HPC systems also were very "unique". It wasn't just all Linux but a dozen different vendors' Unix variants with very different personalities. And for the bleeding-edge systems, each deployment was practically its own dialect of that vendor OS. Running a job was like cross-compiling to a one of a kind target. There was no generic platform where you could expect to build an app once and ship it around to whichever supercomputer was available.

          • pphysch 8 hours ago
            Agreed on all points and this captures the history well.
    • anewhnaccount2 21 hours ago
      How should it be better? Most environments offer Apptainer which can import Docker containers. Plus a lot of theae languages like Julia and Chapel are pretty self contained and programmed against eg ancient libc for these very reasons.
    • rirze 11 hours ago
      It's been years since I last used `slurm`. Thanks for the blast from the past.
    • sliken 15 hours ago
      As you dig deeper I think you'll find a method behind the madness.

      Sure modules just play with env variables. But it's easy to inspect (module show), easy to document "use modules load ...", allows admins to change the default when things improve/bug fixed, but also allows users to pin the version. It's very transparent, very discover-able, and very "stale". Research needs dictate that you can reproduce research from years past. It's much easier to look at your output file and see the exact version of compiler, MPI stack, libraries, and application than trying to dig into a container build file or similar. Not to mention it's crazy more efficient to look at a few lines of output than to keep the container around.

      As for slurm, I find it quite useful. Your main complaint is no default systemd service files? Not like it's hard to setup systemd and dependencies. Slurms job is scheduling, which involves matching job requests for resources, deciding who to run, and where to run it. It does that well and runs jobs efficiently. Cgroup v2, pinning tasks to the CPU it needs, placing jobs on CPU closest to the GPU it's using, etc. When combined with PMIX2 it allows impressive launch speeds across large clusters. I guess if your biggest complaint is the systemd service files that's actually high praise. You did mention logging, I find it pretty good, you can increase the verbosity and focus on server (slurmctld) or client side (slurmd) and enable turning on just what you are interested, like say +backfill. I've gotten pretty deep into the weeds and basically everything slurm does can be logged, if you ask for it.

      Sounds like you've used some poorly run clusters, I don't doubt it, but I wouldn't assume that's HPC in general. I've built HPC clusters and did not use the university's AD, specifically because it wasn't reliable enough. IMO a cluster should continue to schedule and run jobs, even if the uplink is down. Running a past EoL OS on an HPC cluster is definitely a sign that it's not run well and seems common when a heroic student ends up managing a cluster and then graduates leaving the cluster unmanaged. Sadly it's pretty common for IT to run a HPC cluster poorly, it's really a different set of contraints, thus the need for a HPC group.

      Plenty of HPC clusters out there a happy to support the tools that helps their users get the most research done.

  • hpcgroup 17 hours ago
    [dead]
  • kevinten10 23 hours ago
    [dead]
  • chinabot 21 hours ago
    There has been a very big adoption of ENGLISH as a programming language in the last year or so, and, painful as it sounds, AI is already generating machine code without compilers, so let's see where we are in 2030.