How NASA built Artemis II’s fault-tolerant computer

(cacm.acm.org)

347 points | by speckx 17 hours ago

21 comments

  • dmk 8 hours ago
    The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.
    • carefree-bob 4 hours ago
      During the time of the first Apollo missions, a dominant portion of computing research was funded by the defense department and related arms of government, making this type of deterministic and WCET (worst case execution time) a dominant computing paradigm. Now that we have a huge free market for things like online shopping and social media, this is a bit of a neglected field and suffers from poor investment and mindshare, but I think it's still a fascinating field with some really interesting algorithms -- check out the work of Frank Mueller or Johann Blieberger.
      • therobots927 3 hours ago
        Contrary to propaganda from the likes of Ludwig von Mises, the free market is not some kind of optimal solution to all of our problems. And it certainly does not produce excellent software.
        • psd1 2 hours ago
          I can't think of a time when I've found an absolutist position useful or intelligent, in any field. Free-market absolutism is as stupid as totalitarianism. The content of economics papers does not need to be evaluated to discard an extreme position, one need merely say "there are more things in earth and heaven than are dreamed of in your philosophies"
        • nairboon 16 minutes ago
          Propaganda is quite a strong term to describe the works of an economist. If one wants to debate the ideas of von Mises, it'd be useful to consider the Zeitgeist at that time. Von Mises preferred free markets in contrast to the planned economy of the communists. Partly because the latter has difficulties in proper resource allocation and pricing. Note that this was decades before we had working digital computers and digital communication systems, which, at least in theory, change the feasibility of a planned economy.

          Also, the last time I checked, the US government produced its goods and services using the free market. The government contractors (private enterprises) are usually tasked with building stuff, compared with the government itself in a non-free, purely planned economy (if you refer to von Mises).

          I assume that you originally meant to refer to the idea that without government intervention (funding for deep R&D), the free market itself would probably not have produced things like the internet or the moon landing (or at least not within the observed time span). That is, however,a rather interesting idea.

    • ggm 1 hour ago
      Time triggered Ethernet is part of aircraft certified data bus and has a deep, decades long history. I believe INRIA did work on this, feeding Airbus maybe. It makes perfect sense when you can design for it. An aircraft is a bounded problem space of inputs and outputs which can have deterministic required minima and then you can build for it, and hopefully even have headroom for extras.

      Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.

      I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.

      • Arch-TK 56 minutes ago
        A real squirrel would need acorns, I would assume it's a clockwork squirrel too.
    • iknowstuff 5 hours ago
      Tesla’s Cybertruck uses that in its ethernet as well!
      • carefree-bob 3 hours ago
        All the ADAS automotive systems use this, there are several startups in this space as well, such as Ethernovia.
    • dyauspitr 5 hours ago
      Agile is not meant to make solid, robust products. It’s so you can make product fragments/iterations quickly, with okay quality and out to the customer asap to maximize profits.
      • nickff 4 hours ago
        “Agile” doesn’t mean that you release the first iteration, it’s just a methodology that emphasizes short iteration loops. You can definitely develop reliable real-time systems with Agile.
        • tomasGiden 2 hours ago
          I would differentiate between iterative development and incremental development.

          Incremental development is like panting a picture line by line like a printer where you add new pieces to the final result without affecting old pieces.

          Iterative is where you do the big brush strokes first and then add more and more detail dependent on what to learn from each previous brush strokes. You can also stop at any time when you think that the final result is good enough.

          If you are making a new type of system and don’t know what issues will come up and what customers will value (highly complex environment) iterative is the thing to do.

          But if you have a very predictable environment and you are implementing a standard or a very well specified system (van be highly complicated yet not very complex), you might as will do incremental development.

          Roughly speaking though as there is of course no perfect specification which is not the final implementation so there are always learnings so there is always some iterative parts of it.

        • kermatt 4 hours ago
          > “Agile” doesn’t mean that you release the first iteration

          Someone needs to inform the management of the last three companies I worked for about this.

          • t43562 2 hours ago
            Management understand it less than anyone else does.
        • g6pdh 33 minutes ago
          A physicist who worked on radiation-tolerant electronics here. Apart from the short iteration loops, agile also means that the SW/HW requirements are not fully defined during the first iterations, because they may also evolve over time. But this cannot be applied to projects where radiation/fault tolerance is the top priority. Most of the time, the requirements are 100% defined ahead of time, leading to a waterfall-like or a mixed one, where the development is still agile but the requirements are never discussed again, except in negligible terms.
      • buster 3 hours ago
        You hopefully know thats not true. But it's a matter of quality goals. Need absolute robustness? Prioritize it and build it. Need speed and be first to market? Prioritize and build it. You can do both in an agile way. Many would argue that you won't be as fast in a non-agile way. There is no bullet point in the agile manifest saying to build unreliable software.
        • dyauspitr 1 hour ago
          Yeah, I know it’s not true in the sense that that’s not what it’s meant to do, but I’m saying practically that’s what usually ends up happening.
      • froddd 3 hours ago
        The manifesto refers to “working software”. It does not say anything about “okay quality”.
      • sylware 54 minutes ago
        ... and it mechanically promotes planned obsolescence by its nature (likely to be of disastrous quality). The perfect mur... errr... the perfect fraud.
    • arduanika 7 hours ago
      You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.

      That alone is worth my tax dollars.

    • tayk47999 8 hours ago
      > “Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”

      Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.

      I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.

    • pjmlp 3 hours ago
      As 70's child that was there when the whole agile took over, and systems engineer got rebranded as devops, I fully agree with them.

      Add TDD, XP and mob programming as well.

      While in some ways better than pure waterfall, most companies never adopted them fully, while in some scenarios they are more fit to a Silicon Valley TV show than anything else.

    • mvkel 6 hours ago
      If you look at code as art, where its value is a measure of the effort it takes to make, sure.
      • stodor89 4 hours ago
        Or if you're building something important, like a spaceship.
      • BobbyTables2 5 hours ago
        In that case, our test infrastructure belongs in the Louvre…
      • couchand 6 hours ago
        If your implication is that stencil art does not take effort then perhaps you may not fully appreciate Banksy. Works like Gaza Kitty or Flower Thrower don’t just appear haphazardly without effort.
    • vasco 4 hours ago
      It's not like the approach they took is any different. Just slapped 8x the number of computers on it for calculating the same thing and wait to see if they disagree. Not the pinnacle of engineering. The equivalent of throwing money at the problem.
      • curiousObject 3 hours ago
        >Just slapped 8x the number of computers on it

        ‘Just’ is not an appropriate word in this context. Much of the article is about the difficulty of synchronization, recovery from faults, and about the redundant backup and recovery systems

      • MikeTheGreat 3 hours ago
        What happens when they don't?
        • vasco 3 hours ago
          If you have a point to make, make it.
          • MikeTheGreat 3 hours ago
            What my question is hinting at is that there's actually some really interesting engineering around resolving what happens when the systems disagree. Things like Paxos and Raft help make this much more tractable for mere mortals (like myself); the logic and reasoning behind them are cool and interesting.
            • FabHK 2 hours ago
              Though here the consensus algorithm seems totally different from Paxos/Raft. Rather it's a binary tree, where every non-leaf node compares the (non-silent) inputs from the leaf, and if they're different, it falls silent, else propagates the (identical) results up. Or something something.
            • vasco 2 hours ago
              Wasn't that way better, there's no need to drop bait. Thanks.
    • ramraj07 8 hours ago
      I take the opposite message from that line - out of touch teams working on something so over budget and so overdue, and so bureaucratic, and with such an insanely poor history of success, and they talk as if they have cured cancer.

      This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.

      Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.

      • danhon 8 hours ago
        You mean like this?

        "With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."

        https://blog.codinghorror.com/building-a-computer-the-google...

        • kukkeliskuu 2 hours ago
          The biggest innovation from Google regarding hardware was understanding that the dropping memory prices had made it feasible to serve most data directly from memory. Even as memory was more expensive, you could serve requests faster, meaning less server capacity, meaning reduced cost. In addition to serving requests faster.
        • ramraj07 7 hours ago
          The problem they solved isn't easy. But its not some insane technical breakthrough either. Literally add redundancy, thats the ask. They didnt invent quantum computing to solve the issue did they? Why dunk on sprints?
          • vlovich123 6 hours ago
            Wow. What a hand wave away of the intrinsic challenge of writing fault tolerant distributed systems. It only seems easy because of decades of research and tools built since Google did it, but by no means was it something you could trivially add to a project as you can today.
            • tempest_ 5 hours ago
              > fault tolerant distributed systems

              I mean there were mainframes which could be described as that. IBM just fixed it in hardware instead of software so its not like it was an unknown field.

              • vlovich123 3 hours ago
                Even if that were actually true (it’s not in important ways) Google showed you could do this cheaply in software instead of expensive in hardware.

                You’re still hand waving away things like inventing a way to make map/reduce fault tolerant and automatic partitioning of data and automatic scheduling which didn’t exist before and made map/reduce accessible - mainframes weren’t doing this.

                They pioneered how you durably store data on a bunch of commodity hardware through GFS - others were not doing this. And they showed how to do distributed systems at a scale not seen before because the field had bottlenecked on however big you could make a mainframe.

        • 1970-01-01 7 hours ago
          Google then had complete regret not doing this with ECC RAM: https://news.ycombinator.com/item?id=14206811
          • newmana 5 hours ago
            A great version of this and how ex-DEC engineers saved Google and their choice of ECC RAM - inventing MapReduce and BigTable https://www.youtube.com/watch?v=IK0I4f8Rbis
          • ramraj07 7 hours ago
            It got them to where they need to be to then worry about ECC. This is like the dudes who deploy their blog on kubernetes just in case it hits front page of new york times or something.
            • JumpCrisscross 2 hours ago
              > then had complete regret not doing this with ECC RAM

              Yeah, my takeaway is Google made the right choice going with non-ECC RAM so they could scale quickly and validate product-market fit. (This also works from a perspective of social organisation. You want your ECC RAM going where it's most needed. Not every college dropout's Hail Mary.)

      • bluegatty 7 hours ago
        No, space is just hard.

        Everything is bespoke.

        You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.

        People died on the Apollo missions.

        It just costs that much.

        • arduanika 7 hours ago
          Please, this is hacker news. Nothing else is hard outside of our generic software jobs, and we could totally solve any other industry in an afternoon.
          • geerlingguy 7 hours ago
            I mean I can just replace Dropbox with a shell script.
            • InsideOutSanta 47 minutes ago
              "No wireless. Less space than a Nomad. Lame."

              No, wait, that was that other site.

            • bluegatty 7 hours ago
              That's funny because you could! Dropbox started a shell script :)

              Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.

        • ramraj07 6 hours ago
          Yep, spend 100 billion on what should have cost 1/50that cost, and send people up to the moon with rockets that we are still keeping our fingers crossed wont kill them tomorrow, and we have to congratulate them for dunking on some irrelevant career?
      • bfung 7 hours ago
        One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.
      • therobots927 3 hours ago
        Modern software development is a fucking joke. I’m sorry if that offends you. Somehow despite Moore’s law, the industry has figured out how to actually regress on quality.
        • childintime 37 minutes ago
          Lately it strikes me there's a big gap between the value promised and the value actually delivered, compared to a simple home grown solutions (with a generic tool like a text editor or a spreadsheet, for example). If they'd just show how to fish, we wouldn't be buying, the magic would be gone.

          In this sense all of the West is full of shit, and it's a requirement. The intent is not to help and make life better for everyone, cooperate, it is to deceive and impoverish those that need our help. Because we pity ourselves, and feed the coward within, that one that never took his first option and chose to do what was asked of him instead.

          This is what our society deviates us from, in its wish to be the GOAT, and control. It results in the production of lives full of fake achievements, the constant highs which i see muslims actively opt out of. So they must be doing something right.

        • misiek08 2 hours ago
          And overall performance in terms of visible UX.
      • HNisCIS 7 hours ago
        What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?
        • mlsu 5 hours ago
          No, of course not! It would be far better to have an openClaw instance running on a Mac Mini. We would only need to vibe code a 15s cron job for assistant prompting...

          USER: You are a HELPFUL ASSISTANT. You are a brilliant robot. You are a lunar orbiter flight computer. Your job is to calculate burn times and attitudes for a critical mission to orbit the moon. You never make a mistake. You are an EXPERT at calculating orbital trajectories and have a Jack Parsons level knowledge of rocket fuel and engines. You are a staff level engineer at SpaceX. You are incredible and brilliant and have a Stanley Kubrick level attention to detail. You will be fired if you make a mistake. Many people will DIE if you make any mistakes.

          USER: Your job is to calculate the throttle for each of the 24 orientation thrusters of the spacecraft. The thrusters burn a hypergolic monopropellent and can provide up to 0.44kN of thrust with a 2.2 kN/s slew rate and an 8ms minimum burn time. Format your answer as JSON, like so:

               ```json
              {
                x1: 0.18423
                x2: 0.43251
                x3: 0.00131
                 ...
              }
               ```
          
          one value for each of the 24 independent monopropellant attitude thrusters on the spacecraft, x1, x2, x3, x4, y1, y2, y3, y4, z1, z2, z3, z4, u1, u2, u3, u4, v1, v2, v3, v4, w1, w2, w3, w4. You may reference the collection of markdown files stored in `/home/user/geoff/stuff/SPACECRAFT_GEOMETRY` to inform your analysis.

          USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.

        • ramraj07 6 hours ago
          All Im suggesting is to be humble about your mediocre solutions. This is not the only solution and not that ingenious necessarily. Why do you need to bring up vibecoding here? Because people who criticize arrogant nasal engineers are also AI idiots by default?
          • InsideOutSanta 49 minutes ago
            Can't tell if "arrogant nasal engineers" is a typo or a hilarious attempt at an insult.
          • ToucanLoucan 6 hours ago
            Wild shit to be advising other people to be humble whilst talking directly out of your ass about technology you clearly do not understand and engineers you have no respect for.

            Perhaps self-reflect.

      • simoncion 8 hours ago
        > ...they talk as if they have cured cancer.

        I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.

        • misiek08 2 hours ago
          So the quote is right somewhat, right? If you are writing to non technical people and you use such high wording.
  • GautamB13 0 minutes ago
    It kinda crazy how this mission didn't become mainstream media until as of late.
  • georgehm 4 hours ago
    >Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a >“fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation >due to a radiation event, the error is detected immediately and the system responds.

    >“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.

    One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.

    • guai888 4 hours ago
      These CPUs are typically implemented as lockstep pairs on the same die. In a lockstep architecture, both CPUs execute the same operations simultaneously and their outputs are continuously compared. As a result, the failure rate associated with an undetected erroneous calculation is significantly lower than the FIT rate of an individual CPU.

      Put another way, the FIT (Failure in Time) value for the condition in which both CPUs in a lockstep pair perform the same erroneous calculation and still produce matching results is extremely small. That is why we selected and accepted this lockstep CPU design

    • CubicalOrange 2 hours ago
      the probability of simultaneous cosmic ray bit-flip in 2 CPUs, in the same bit, is ridiculously low, there might be more probability of them getting hit by a stray asteroid, propelled by a solar flare.

      but still, murphy's law applies really well in space, so who knows.

    • alfons_foobar 2 hours ago
      I wondered about this as well.

      OTOH, consider that in the "pick the majority from 3 CPUs" approach that seems to have been used in earlier missions (as mentioned in the article) would fail the same way if two CPUs compute the same erroneous result.

    • FabHK 2 hours ago
      Indeed. It seems like system 1 and 2 could fail identically, 3, 4, 5, 6, 7, 8 are all correct, and as described the wrong answer from 1 and 2 would be chosen (with a "25% majority"??).
    • themafia 4 hours ago
      In the Shuttle they would use command averaging. All four computers would get access to an actuator which would tie into a manifold which delivered power to the flight control surface. If one disagreed then you'd get 25% less command authority to that element.
      • JumpCrisscross 1 hour ago
        > In the Shuttle they would use command averaging

        I think the Shuttle, operating only in LEO, had more margin for error. Averaging a deep-space burn calculation is basically the same as killing the crew.

        • themafia 1 hour ago
          The GNC loop runs several times per second. The desired output will consequently be increased by the working computers to achieve the target. The computer does not "dead reckon" anything.

          Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous.

          • JumpCrisscross 1 hour ago
            > Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous

            Fair enough. I don't know enough about Orion's architecture to guess at propellant reserves, and how life-or-death each burn actually is.

  • __d 6 hours ago
    Does anyone have pointers to some real information about this system? CPUs, RAM, storage, the networking, what OS, what language used for the software, etc etc?

    I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.

    • anthonj 2 hours ago
      Nasa CFS, is written is plain C (trying to follow MISRA C, etc). It's open on girhub abd used by many companies. It's typically run over freertos or RTEMS, not sure here.

      Personally I find the project extremely messy, and kinda hate working with it.

  • y1n0 7 hours ago
    NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.
    • voodoo_child 7 hours ago
      Nice “well, actually”. I’m sure Lockheed were building this quad-redundant, radiation-hardened PowerPC that costs millions of dollars and communicates via Time-Triggered Ethernet anyway, whether NASA needed one or not.
      • kube-system 5 hours ago
        Probably, if it already wasn’t developed for DoD.

        For example, the OS it seems to be running is integrity 178.

        https://www.ghs.com/products/safety_critical/integrity_178_s...

        Aerospace tech is not entirely bespoke anymore, plenty of the foundational tech is off the shelf.

        Historically, the main difference between ICBM tech and human spaceflight tech is the payload and reentry system.

      • y1n0 5 hours ago
        This is the equivalent of prompt engineering.
    • jakeinspace 4 hours ago
      True, but BFS was mainly done in-house. Source: my best friend and I worked on some parts of it.
    • adrian_b 6 hours ago
      Lockheed Martin and their subcontractors did the implementation.

      We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.

      • y1n0 5 hours ago
        I do.
        • professorseth 5 hours ago
          Are you interested in sharing more details to make your claim more believable?
    • colechristensen 4 hours ago
      Eh, in these kinds of subcontractor relationships there is a lot of work and communication on both sides of the table.
    • Sebguer 6 hours ago
      will nobody think of the megacorps!!!
    • therobots927 3 hours ago
      Lockheed Martin also builds F-35s that Israel uses to slaughter children. If you’re going to give them credit for everything, don’t forget to give them credit for that.
  • geomark 5 hours ago
    I sure wish they would talk about the hardware. I spent a few years developing a radiation hardened fault tolerant computer back in the day. Adding redundancy at multiple levels was the usual solution. But there is another clever check on transient errors during process execution that we implemented that didn't involve any redundancy. Doesn't seem like they did anything like that. But can't tell since they don't mention the processor(s) they used.
    • themafia 4 hours ago
      One of the things I loved about the Shuttle is that all five computers were mounted not only in different locations but in different orientations in the shuttle. Providing some additional hardening against radiation by providing different cross sections to any incident event.
  • JumpCrisscross 1 hour ago
    Does anyone know how this compares to Crew Dragon or HLS?
  • jbritton 7 hours ago
    I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.
    • EdNutting 7 hours ago
      Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c...

      Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).

      And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).

      • EdNutting 12 minutes ago
        Typo: “both” ~ “not”
    • Tomte 56 minutes ago
      IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).

      That was in the 2000s though, and for embedded memory above 65nm.

      And obviously on earth.

    • tosapple 6 hours ago
      [dead]
  • vhiremath4 3 hours ago
    > “Along with physically redundant wires, we have logically redundant network planes. We have redundant flight computers. All this is in place to cover for a hardware failure.”

    It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!

  • gambiting 47 minutes ago
    So honest and perhaps a bit stupid question.

    Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?

    • EdNutting 2 minutes ago
      They’re not mission-critical equipment. If they fail, nobody dies.

      They’re not radiation hardened, so given enough time, they’d be expected to fail. Rebooting them might clear the issue or it might not (soft vs hard faults).

      Also impossible to predict when a failure would happen, but NASA, ESA and others have data somewhere that makes them believe the risk is high enough that mission critical systems need this level of redundancy.

  • starkparker 17 hours ago
    Headline needs its how-dectomy reverted to make sense
    • arduanika 7 hours ago
      (Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.
  • object-a 8 hours ago
    How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything
    • pjerem 2 hours ago
      Orbital data centers are still nothing more than the current hyperloop.
    • willdr 1 hour ago
      Orbital data centres are a stupid concept.
    • aidenn0 5 hours ago
      You don't need 4x redundancy for everything. If no humans are aboard, you have 2x redundancy and immediately reboot if there is a disagreement.
    • totetsu 7 hours ago
      They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?
      • pclmulqdq 7 hours ago
        Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.
      • kersplody 7 hours ago
        NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...
      • linzhangrun 7 hours ago
        It seems not; anti-interference primarily relies on using older manufacturing processes, including for military equipment, and then applying an anti-interference casing or hardware redundancy correction similar to ECC.
  • spaceman123 4 hours ago
    Probably same way they’ve built fault-tolerant toilet.
    • jeron 1 hour ago
      ctrl+f toilet, thank you for already commenting this
  • nickpsecurity 5 hours ago
    The ARINC scheduler, RTOS, and redundancy have been used in safety-critical for decades. ARINC to the 90's. Most safety-critical microkernels, like INTEGRITY-178B and LynxOS-178B, came with a layer for that.

    Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.

    Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.

  • SeanAnderson 4 hours ago
    Typo in the first sentence of the first paragraph is oddly comforting since AI wouldn't make such a typo, heh.

    Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.

    • tux 4 hours ago
      I think everyone should now make mistakes so we ca distinguish human vs ai.
      • zeristor 32 minutes ago
        This can be optimised for no doubt, adversarial training is like that
  • temptemptemp111 1 hour ago
    [dead]
  • perarneng 2 hours ago
    [dead]
  • ConanRus 8 hours ago
    [dead]
  • hulitu 3 hours ago
    They run 2 Outlook instances. For redundancy. /s
  • seemaze 7 hours ago
    • adrian_b 6 hours ago
      That was a laptop, not one of the Artemis computers.
  • ajaystream 2 hours ago
    The fail-silent design is the part worth paying attention to. The conventional approach to redundancy is to compare outputs and vote — three systems, majority wins. What NASA did here instead is make each unit responsible for detecting its own faults and shutting up if it can't guarantee correctness. Then the system-level logic just picks the first healthy source from a priority list.

    That's a fundamentally different trust model. Voting systems assume every node will always produce output and the system needs to figure out which output is wrong. Fail-silent assumes nodes know when they're compromised and removes them from the decision set entirely. Way simpler consensus at the system level, but it pushes all the complexity into the self-checking pair.

    The interesting question someone raised — what if both CPUs in a pair get the same wrong answer — is the right one. Lockstep on the same die makes correlated faults more likely than independent failures. The FIT numbers are presumably still low enough to be acceptable, but it's the kind of thing that only matters until it does.

    • adrian_b 1 hour ago
      This is similar to the difference between using error-correcting codes and using erasure codes combined with error-detecting codes.

      The latter choice is frequently simpler and more reliable for preventing data corruption. (An erasure code can be as simple as having multiple copies and using the first good copy.)

    • sammy2255 59 minutes ago
      Spoken like an LLM.
    • high_na_euv 21 minutes ago
      How you can remove component from decision set if it is the only component in the whole decision set?