How I write software with LLMs

(stavros.io)

119 points | by indigodaddy 6 hours ago

14 comments

  • prpl 0 minutes ago
    I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
  • akhrail1996 1 hour ago
    Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

    The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.

    • est 28 minutes ago
      > the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

      There's a 63 page paper with mathematical proof if you really into this.

      https://arxiv.org/html/2601.03220v1

      My takeaway: AI learns from real-world texts, and real-world corpus are used to have a role split of architect/developer/reviewer

      • codeflo 0 minutes ago
        >> the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

        > There's a 63 page paper with mathematical proof if you really into this.

        > https://arxiv.org/html/2601.03220v1

        I'm confused. The linked paper is not primarily a mathematics paper, and to the extent that it is, proves nothing remotely like the question that was asked.

    • totomz 30 minutes ago
      I think the splitting make sense to give more specific prompts and isolated context to different agents. The "architect" does not need to have the code style guide in its context, that actually could be misleading and contains information that drives it away from the architecture
    • kybernetikos 13 minutes ago
      There's a lot of cargo culting, but it's inevitable in a situation like this where the truth is model dependent and changing the whole time and people have created companies on the premise they can teach you how to use ai well.
    • jaredklewis 48 minutes ago
      > what's the evidence

      What’s the evidence for anything software engineers use? Tests, type checkers, syntax highlighting, IDEs, code review, pair programming, and so on.

      In my experience, evidence for the efficacy of software engineering practices falls into two categories:

      - the intuitions of developers, based in their experiences.

      - scientific studies, which are unconvincing. Some are unconvincing because they attempt to measure the productivity of working software engineers, which is difficult; you have to rely on qualitative measures like manager evaluations or quantitative but meaningless measures like LOC or tickets closed. Others are unconvincing because they instead measure the practice against some well defined task (like a coding puzzle) that is totally unlike actual software engineering.

      Evidence for this LLM pattern is the same. Some developers have an intuition it works better.

      • thesz 30 minutes ago
        You can measure customer facing defects.

        Also, lines of code is not completely meaningless metric. What one should measure is lines of code that is not verified by compiler. E.g., in C++ you cannot have unbalanced brackets or use incorrectly typed value, but you still may have off-by-one error.

        Given all that, you can measure customer facing defect density and compare different tools, whether they are programming languages, IDEs or LLM-supported workflow.

      • codemog 31 minutes ago
        My friend, there’s tons of evidence of all that stuff you talked about in hundreds of papers on arxiv. But you dismiss it entirely in your second bullet point, so I’m not entirely sure what you expect.
      • jacquesm 20 minutes ago
        The proper metric is the defect escape rate.
        • exidex 12 minutes ago
          Now you have to count defects
          • jacquesm 10 minutes ago
            You have to do that anyway, and in fact you probably were already doing that. If you do not track this then you are leaving a lot on the table.
    • palmotea 32 minutes ago
      > Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?

      Using multiple agents in different roles seems like it'd guard against one model/agent going off the rails with a hallucination or something.

    • jumploops 30 minutes ago
      After "fully vibecoding" (i.e. I don't read the code) a few projects, the important aspect of this isn't so much the different agents, but the development process.

      Ironically, it resembles waterfall much more so than agile, in that you spec everything (tech stack, packages, open questions, etc.) up front and then pass that spec to an implementation stage. From here you either iterate, or create a PR.

      Even with agile, it's similar, in that you have some high-level customer need, pass that to the dev team, and then pass their output to QA.

      What's the evidence? Admittedly anecdotal, as I'm not sure of any benchmarks that test this thoroughly, but in my experience this flow helps avoid the pitfall of slop that occurs when you let the agent run wild until it's "done."

      "Done" is often subjective, and you can absolutely reach a done state just with vanilla codex/claude code.

      Note: I don't use a hierarchy of agents, but my process follows a similar design/plan -> implement -> debug iteration flow.

    • troupo 1 hour ago
      > produces better results than just... talking to one strong model in one session?

      I think the author admits that it doesn't, doesn't realise it and just goes on:

      --- start quote ---

      On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet

      --- end quote ---

    • imiric 51 minutes ago
      Evidence? My friend, most of the practices in this field are promoted and adopted based on hand-waving, feelings, and anecdata from influencers.

      Maybe you should write and share your own article to counter this one.

      • z3t4 37 minutes ago
        Also if something is fun, we prefer to to it that way instead of the boring way. Then it depends on how many mines you step on, after a while you try to avoid the mines. That's when your productivity goes down radically. If we see something shiny we'll happily run over the minefield again though.
  • kleiba 7 minutes ago
    I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.

    Could someone chime in and give their opinion on what are the pros and cons of either approach?

  • thenthenthen 1 hour ago
    Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
  • christofosho 5 hours ago
    I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.

    I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.

    Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA

    1. https://github.com/humanlayer/advanced-context-engineering-f...

    • marcus_holmes 4 hours ago
      that reference you give is pretty dated now, based on a talk from August which is the Beforetimes of the newer models that have given such a step change in productivity.

      The key change I've found is really around orchestration - as TFA says, you don't run the prompt yourself. The orchestrator runs the whole thing. It gets you to talk to the architect/planner, then the output of that plan is sent to another agent, automatically. In his case he's using an architect, a developer, and some reviewers. I've been using a Superpowers-based [0] orchestration system, which runs a brainstorm, then a design plan, then an implementation plan, then some devs, then some reviewers, and loops back to the implementation plan to check progress and correctness.

      It's actually fun. I've been coding for 40+ years now, and I'm enjoying this :)

      [0] https://github.com/obra/superpowers

      • indigodaddy 3 hours ago
        Can you bolt superpowers onto an existing project so that it uses the approach going forward (I'm using Opencode), or would that get too messy?
        • eclipxe 1 hour ago
          Yes. But gsd is even better - especially gsd2
    • felixsells 2 hours ago
      re: breaking into specialized subagents -- yes, it matters significantly but the splitting criteria isn't obvious at first.

      what we found: split on domain of side effects, not on task complexity. a "researcher" agent that only reads and a "writer" agent that only publishes can share context freely because only one of them has irreversible actions. mixing read + write in one agent makes restart-safety much harder to reason about.

      the other practical thing: separate agents with separate context windows helps a lot when you have parts of the graph that are genuinely parallel. a single large agent serializes work it could parallelize, and the latency compounds across the whole pipeline.

  • silisili 2 hours ago
    I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.

    I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?

    • PAndreew 1 hour ago
      Others have already partially answered this, but here’s my 20 cents. Software development really is similar to architecture. The end result is an infrastructure of unique modules with different type of connectors (roads, grid, or APIs). Until now in SW dev the grunt work was done mostly by the same people who did the planning, decided on the type of connectors, etc. Real estate architects also use a bunch of software tools to aid them, but there must be a human being in the end of the chain who understands human needs, understands - after years of studying and practicing - how the whole building and the infrastructure will behave at large and who is ultimately responsible for the end result (and hopefully rewarded depending on the complexity and quality of the end result). So yes we will not need as many SW engineers, but those who remain will work on complex rewarding problems and will push the frontier further.
      • rurban 14 minutes ago
        Since I worked as an architect some comments.

        Architecture is fine for big, complex projects. Having everything planned out before keeps cost down, and ensures customer will not come with late changes. But if cost are expected to be low, and there's no customer, architecture is overkill. It's like making a movie without following the script line by line (watch Godard in Novelle Vague), or building it by yourself or by a non-architect. 2x faster, 10x cheaper. You immediately see an inflexible overarchitectured project.

        You can do fine by restricting the agent with proper docs, proper tests and linters.

    • chii 1 hour ago
      > Then what is our use?

      You will have to find new economic utility. That's the reality of technological progress - it's just that the tech and white collar industries didn't think it can come for them!

      A skill that becomes obsoleted is useless, obviously. There's still room for artisanal/handcrafted wares today, amidst the industrial scale productions, so i would assume similar levels for coding.

    • borski 2 hours ago
      LLMs can build anything. The real question is what is worth building, and how it’s delivered. That is what is still human. LLMs, by nature of not being human, cannot understand humans as well as other humans can. (See every attempt at using an LLM as a therapist)

      In short: LLMs will eventually be able to architect software. But it’s still just a tool

      • silisili 1 hour ago
        What is the use of software eng/architect at that point? It's a tool, but one that product or C levels can use directly as I see it?
        • borski 1 hour ago
          Yes, for building something

          But for building the right thing? Doubtful.

          Most of a great engineer’s work isn’t writing code, but interrogating what people think their problems are, to find what the actual problems are.

          In short: problem solving, not writing code.

          • mattmanser 10 minutes ago
            Where's this delusion come from recently that great engineers didnt write code?

            What a load of crap.

            All you're doing is describing a different job role.

            What you're talking about is BA work, and a subset of engineers are great at it, but most are just ok.

            You're claiming a part of the job that was secondary, and not required, is now the whole job.

            • borski 3 minutes ago
              I never said great engineers didn’t write code. But writing the code was never the point.

              The point has always been delivering the product to the customer, in any industry. Code is rarely the deliverable.

              That’s my point.

        • 0xbadcafebee 1 hour ago
          A software engineer will be a person who inspects the AI's work, same as a building inspector today. A software architect will co-sign on someone's printed-up AI plans, same as a building architect today. Some will be in-house, some will do contract work, and some will be artists trying to create something special, same as today. The brute labor is automated away, and the creativity (and liability) is captured by humans.
      • roncesvalles 1 hour ago
        FWIW I find LLMs to be excellent therapists.

        The commercial solutions probably don't work because they don't use the best SOTA models and/or sully the context with all kinds of guardrails and role-playing nonsense, but if you just open a new chat window in your LLM of choice (set to the highest thinking paid-tier model), it gives you truly excellent therapist advice.

        In fact in many ways the LLM therapist is actually better than the human, because e.g. you can dump a huge, detailed rant in the chat and it will actually listen to (read) every word you said.

        • borski 1 hour ago
          Please, please, please don’t make this mistake. It is not a therapist. At best, it might be a facsimile of a life coach, but it does not have your best interests in mind.

          It is easy to convince and trivial to make obsequious.

          That is not what a therapist does. There’s a reason they spend thousands of hours in training; that is not an exaggeration.

          Humans are complex. An LLM cannot parse that level of complexity.

          • roncesvalles 51 minutes ago
            You seem to think therapists are only for those in dire straits. Yes, if you're at that point, definitely speak to a human. But there are many ordinary things for which "drop-in" therapist advice is also useful. For me: mild road rage, social anxiety, processing embarrassment from past events, etc.

            The tools and reframing that LLMs have given me (Gemini 3.0/3.1 Pro) have been extremely effective and have genuinely improved my life. These things don't even cross the threshold to be worth the effort to find and speak to an actual therapist.

            • defrost 44 minutes ago
              Which professional therapist does your Gemini 3.0/3.1 Pro model see?

              Do you think I could use an AI therapist to become a more effective and much improved serial killer?

            • borski 45 minutes ago
              I never said therapists were only for those in crisis; that is a misreading of my argument entirely.

              An LLM cannot parse the complexity of your situation. Period. It is literally incapable of doing that, because it does not have any idea what it is like to be human.

              Therapy is not an objective science; it is, in many ways, subjective, and the therapeutic relationship is by far the most important part.

              I am not saying LLMs are not useful for helping people parse their emotions or understand themselves better. But that is not therapy, in the same way that using an app built for CBT is not, in and of itself, therapy. It is one tool in a therapist’s toolbox, and will not be the right tool for all patients.

              That doesn’t mean it isn’t helpful.

              But an LLM is not a therapist. The fact that you can trivially convince it to believe things that are absolutely untrue is precisely why, for one simple example.

          • pzs 27 minutes ago
            While I agree with you, I also find that an LLM can help organize my thoughts and come to realizations that I just didn't get to, because I hadn't explained verbally what I am thinking and feeling. Definitely not a substitute for human interaction and relationships, which can be fulfilling in many-many ways LLM's are not, but LLM's can still be helpful as long as you exercise your critical thinking skills. My preference remains always to talk to a friend though.

            EDIT: seems like you made the same point in a child comment.

            • borski 23 minutes ago
              Yeah, I agree with all of that. A friend built an “emotion aware” coach, and it is extremely useful to both of us.

              But he still sees a therapist, regularly, because they are not the same and do not serve the same purpose. :)

  • jumploops 1 hour ago
    This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:

    The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.

    This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

    Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.

    A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).

    I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).

    One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).

    • lelele 1 hour ago
      > The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step. > > This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.

      Would you please expand on this? Do you make the LLM append their responses to a Markdown file, prefixed by their timestamps, basically preserving the whole context in a file? Or do you make the LLM update some reference files in order to keep a "condensed" context? Thank you.

      • aix1 38 minutes ago
        Not the GP, but I currently use a hierarchy of artifacts: requirements doc -> design docs (overall and per-component) -> code+tests. All artifacts are version controlled.

        Each level in the hierarchy is empirically ~5X smaller than the level below. This, plus sharding the design docs by component, helps Claude navigate the project and make consistent decision across sessions.

        My workflow for adding a feature goes something like this:

        1. I iterate with Claude on updating the requirements doc to capture the desired final state of the system from the user's perspective.

        2. Once that's done, a different instance of Claude reads the requirements and the design docs and updates the latter to address all the requirements listed in the former. This is done interactively with me in the loop to guide and to resolve ambiguity.

        3. Once the technical design is agreed, Claude writes a test plan, usually almost entirely autonomously. The test plan is part of each design doc and is updated as the design evolves.

        3a. (Optionally) another Claude instance reviews the design for soundness, completeness, consistency with itself and with the requirements. I review the findings and tell it what to fix and what to ignore.

        4. Claude brings unit tests in line with what the test plan says, adding/updating/removing tests but not touching code under test.

        4a. (Optionally) the tests are reviewed by another instance of Claude for bugs and inconsistencies with the test plan or the style guide.

        5. Claude implements the feature.

        5a. (Optionally) another instance reviews the implementation.

        For complex changes, I'm quite disciplined to have each step carried out in a different session so that all communinications are done via checked-in artifacts and not through context. For simple changes, I often don't bother and/or skip the reviews.

        From time to time, I run standalone garbage collection and consistency checks, where I get Claude to look for dead code, low-value tests, stale parts of the design, duplication, requirements-design-tests-code drift etc. I find it particularly valuable to look for opportunities to make things simpler or even just smaller (fewer tokens/less work to maintain).

        Occasionally, I find that I need to instruct Claude to write a benchmark and use it with a profiler to opimise something. I check these in but generally don't bother documenting them. In my case they tend to be one-off things and not part of some regression test suite. Maybe I should just abandon them & re-create if they're ever needed again.

        I also have a (very short) coding style guide. It only includes things that Claude consistently gets wrong or does in ways that are not to my liking.

  • plastic041 2 hours ago
    I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.

    You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.

    • ashwinsundar 2 hours ago
      Hot take: you can't have your cake and eat it too. If you aren't writing code, designing the system, creating architecture, or even writing the prompt, then you're not understanding shit. You're playing slots with stochastic parrots

          The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
      
      - Karpathy 2025
      • simonw 1 hour ago
        Your Karpathy quote there is out of context. It starts with: https://twitter.com/karpathy/status/1886192184808149383

          There's a new kind of coding I call "vibe
          coding", where you fully give in to the
          vibes, embrace exponentials, and forget
          that the code even exists.
        
        Not all AI-assisted programming is vibe coding. If you're paying attention to the code that's being produced you can guide it towards being just as high quality (or even higher quality) than code you would have written by hand.
        • ashwinsundar 1 hour ago
          It's appropriate for the commenter I was replying to, who asked how they can understand things, "while having never even read most of their code."

          I like AI-assisted programming, but if I fail to even read the code produced, then I might as well treat it like a no-code system. I can understand the high-levels of how no-code works, but as soon as it breaks, it might as well be a black box. And this only gets worse as the codebase spans into the tens of thousands of lines without me having read any of it.

          The (imperfect) analogy I'm working on is a baker who bakes cakes. A nearby grocery store starts making any cake they want, on demand, so the baker decides to quit baking cakes and buy them from the store. The baker calls the store anytime they want a new cake, and just tells them exactly what they want. How long can that baker call themself a "baker"? How long before they forget how to even bake a cake, and all they can do is get cakes from the grocer?

      • imiric 48 minutes ago
        > Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.

        It's insane that this quote is coming from one of the leading figures in this field. And everyone's... OK that software development has been reduced to chance and brute force?

  • xhale 1 hour ago
    Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?

    (I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)

  • zapkyeskrill 11 minutes ago
    What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
  • imiric 56 minutes ago
    Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
  • biang15343100 1 hour ago
    [dead]
  • indigodaddy 4 hours ago
    This was on the front page and then got completely buried for some reason. Super weird.
    • mjmas 4 hours ago
      On the front page at the moment. Position 12
      • indigodaddy 3 hours ago
        Maybe I missed it. Sometimes when you're scanning for something your brain intentionally doesn't want to see it, I've noticed. Anyway I'm not Stavros obviously, just thought this was a good article.
    • stainlu 4 hours ago
      [flagged]
  • ForgotMyUUID 1 hour ago
    TL DR; Don't, please :)