Agents need control flow, not more prompts

(bsuh.bearblog.dev)

590 points | by bsuh 34 days ago

146 comments

827a 34 days ago
1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".
We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.
We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.
[-]
- DrewADesign 34 days ago
  I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.
  I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
  [-]
  - cogman10 34 days ago
    > However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.
    I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
    Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
    [-]
    - devin 33 days ago
      And then you realize that what you’re using the smaller models for is ALSO decomposable and part of it is just a few if statements, and then you realize that for this feature you don’t actually need or want a model because the performance, reliability, reproducibility are cheaper and better for you and your users.
      [-]
      - jimbokun 33 days ago
        So you have the model write the if statements and put itself out of a job.
        [-]
        gman2093 33 days ago
        Alternatively, and sometimes more cost-efficient: you can find a developer who can write bespoke if statements. There are dozens of us!
        [-]
        saltcured 33 days ago
        So, are we going to end up with a mechanical Turk that pretends it is an LLM but just farms out tasks to gig workers?
        DrewADesign 33 days ago
        Additionally, developers tend to become less expensive as venture capitalists turn off the spigot, while access to giant frontier models becomes way more expensive. Beyond that, a developer might go out and have a beer with you after work, which appeals to the sickos that have the gall to prioritize humanity over fanatical efficiency for corporate gains.
    - aleqs 33 days ago
      Indeed, I've been experimenting with agent workflows, for complicated tasks - where I essentially have a graph of agents with different roles/capabilities, including such things as breaking down complex tasks into simpler ones. There seems to be a point where a complex enough task is better performed by a group of cheaper agents/models than by one agent using one of the SOTA big models, in terms of both quality and cost.
      [-]
      - zozbot234 33 days ago
        The big SOTA models win in world knowledge, that's what all those parameters are for. But a huge fraction of agentic tasks is going to be plain clerical work that needs no special knowledge at all, a much simpler model can do them in a straightforward way.
    - tempest_ 34 days ago
      It is also interesting because you get people with very different use cases arguing about the effectiveness of various models but doing very different things with them.
      Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.
  - zozbot234 33 days ago
    > However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects
    You can have the AI design the custom harness in advance. It's not especially hard work! In fact, the AI could even come up with the workflow itself; it's a different and much simpler problem than trying to stick to it after-the-fact, with a filled-in context.
    [-]
    - AndyNemmity 33 days ago
      That's what my system does. It uses a workflow if one already exists, if not, it just creates one on the fly from the primitives.
      https://github.com/notque/vexjoy-agent
      I would prefer that be deterministic though. This thread has me considering what if anything I can do to make it forced. Like, I could do it with hooks, but that's not elegant at all.
    - mf_kevintruong 33 days ago
      yeah , that is what I am do in with the DAG-aware TUI hypervisor agents https://getspur.dev
  - user34283 34 days ago
    The designing and implementing of a code harness in your workflow can be as simple as running something like /skill-builder.
    You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.
    You can build this in 10 minutes.
    I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.
  - pishpash 34 days ago
    Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.
- fny 33 days ago
  Secret: "compile" that orchestration prompt. Determinism is solved by turning prompts into code that can in turn run agents or run code or both.
  Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!
  [-]
  - Cyan488 33 days ago
    What about when the model trusts itself more than the "black box" you gave it, and hallucinates its use or non-use in favor of reimplementation? I found this video about "intelligent disobedience" interesting.
    https://www.youtube.com/watch?v=Qu-00j9XuF0
  - AndyNemmity 33 days ago
    Yeah, that's how I do skills. If I can make a script, I do. Everything that can be deterministic should be. https://github.com/notque/vexjoy-agent
  - robinduckett 33 days ago
    Exactly this, I tend to work this way. I built an ingestion pipeline to pull concepts out of a novel using Qwen and push them into falkordb this way
  - krzyk 33 days ago
    Could you elaborate what does "compiling orchestration prompt" mean?
    [-]
    - Frost1x 33 days ago
      When you get some abstraction working you concretize it in something deterministic, or sort of “cache” that knowledge bit (aka write me a function, class, library, whatever). In the future, the nondeterministic path now has a deterministic piece to lean on as it explores the problem space. Rinse, repeat, eventually you have a mostly deterministic system now. Leave flexibility in space where you need that nondeterminism.
    - LikesPwsh 33 days ago
      Rather than telling the LLM "loop through these files", tell it "write a script to loop through these files", then hard-code that script somewhere.
      [-]
      - whattheheckheck 33 days ago
        The models will eventually be able to know that they need to do that to get the thing done from natural language
        [-]
        suttontom 31 days ago
        "The models will eventually..." Yeah but they haven't, and it's been years now. Also who cares? We have problems right now that need to be solved.
      - renticulous 33 days ago
        First we gave LLMs access to bash commands. Now we give them access to customized commands which they can reuse. It's English language extending its claws into deterministic programming language. Now can we please have backtracking and dynamic programming like thinking loop built into English language or such orchestration prompts.
    - throawayonthe 33 days ago
      a guess but i think they mean take the orchestration prompt and prompt yet another llm to turn that prompt into code..?
- bob1029 34 days ago
  I saw a major uplift in performance after I combined tools like apply_patch with check_compilation & run_unit_tests. I still call the tool "apply_patch", but it now returns additional information about the build & tests if the patch succeeds. The agent went from ~80% success rate to what seems to be deterministic (so far). I don't bother to describe the compilation and unit testing processes in my prompts anymore. All I need to do is return the results of these things after something triggers them to run as a dependency.
  I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.
  [-]
  - modo_ 33 days ago
    I like this - I think you're not too far off of what's popular these days though. I think similar functionality can be achieved by using the "hook" functionality in claude code / codex.
  - bostonvaulter2 33 days ago
    Can you explain in more detail how you implemented those tools? Is that via a MCP server?
    [-]
    - bob1029 31 days ago
      > Is that via a MCP server?
      No, this all in one application. A Winforms+WebView2 app wraps the chat completion APIs and implements the various tools directly.
- woeirua 34 days ago
  I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.
- rdedev 34 days ago
  I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer
  At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent
- cheshire_cat 34 days ago
  Wouldn't it be more efficient to convert the requirements these 200 markdown files into Playwright tests?
  You could still use an LLM to write and extend the tests, but running the tests would be deterministic and would use less resources.
  [-]
  - tharkun__ 33 days ago
    This type of thing so much.
    AI is being pushed so much at work right now. For non-dev stuff even. The amount of things that people think are "awesome never seen this" is staggering.
    Just because you haven't seen file format X converted to file format Y before and now you asked the LLM to do it and it worked, doesn't mean you needed an LLM for it nor that it's remarkable. The LLM knew how to do it because it learned from a bazillion online sources for deterministic converters that cost nothing (and have open source). But now you're paying, every single time, for a non-deterministic version of it and you find it cool. It's magic ...
    But I guess they deserve it.
    [-]
    - gofreddygo 33 days ago
      > It's magic
      you'll be surprised with how many people are comfortable attributing something they do not understand to Magic.
      more than anything, ai let people who couldn't and wouldn't bother to learn to write simple code, to side step ones who can and build solutions to scratch their own itch. that too faster.
      now human behavior kicks in, and they don't want to hand control back into the hands of people who can code to solve problems.
      put this together and you have a good model to understand the AI sales pitch... Its magic
      like all magic, its but a trick.
      [-]
      - dkersten 33 days ago
        Oh, yes! As someone who has dabbled in card tricks, this so much. People don't understand how its done and can't imagine or conceive of a way that it possibly could be done, so they attribute it to literal magic or demons or whatever. Like, no, I just distracted you for a split second and used sleight of hand.
        Technology is no different: someone has never even considered that this thing could be possible, and now they see it with their own eyes? Incredible! They don't realise that its mundane and has been possible (in much cheaper ways) for a long time. It was like a few years ago when some journalist posted an animation showing how Horizon Zero Dawn does frustum culling and all the non-tech people were all "wow! This game unloads the game world when its not in view! Incredible!", like... yeah? That's how games have worked since the advent of 3D?
- julianlam 34 days ago
  > This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.
  Sorry, you thought a prompt was a suitable replacement for a testing suite?
  [-]
  - zapataband1 34 days ago
    hey man it works great barely and also costs a bunch of money everytime we run it. we also can't trust the results, relax.
  - deadbabe 33 days ago
    If you are invested in AI stocks, this is the way. You are basically funneling money from software companies into your brokerage account. Keep going.
- throawayonthe 33 days ago
  > But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc)
  couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill
  definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases
- mmis1000 34 days ago
  > This started breaking down after ~30 files.
  Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.
  [-]
  - swores 34 days ago
    Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?
    [-]
    - dnh44 34 days ago
      I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.
    - dns_snek 33 days ago
      It's going to work the same regardless of how much you pay, but with Plus you'll run into 5h usage limit rather quickly unless your "multi hour task" spends 90% of the time just waiting around for code to compile. Expect to get an hour or two of active work (single-threaded).
    - shivnathtathe 33 days ago
      If you have any org email, you can get free chatgpt + subscription.
- data-ottawa 33 days ago
  Google ADK might be useful, especially v2 reorients it around graph operators for control flow.
  Your specific case is listed in the v2 docs with an operation that fans out to parallel many tasks then joins the results.
- otikik 33 days ago
  I never tell claude to "go over this bunch of files and do this".
  I tell it "write a program that goes over this bunch of files and do this".
  Sometimes "do this" can be invoking another claude instance.
- sroussey 34 days ago
  I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.
  [-]
  - tonylucas 34 days ago
    I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.
  - cluckindan 34 days ago
    Jira for agents?
    [-]
    - werrett 33 days ago
      c.f. Linear for Agents
      https://linear.app/agents
- crsn 34 days ago
  Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript
  [-]
  - zapataband1 34 days ago
    No you didn't
    "What we're not open sourcing (yet) is the runtime. "
    [-]
    - tadfisher 33 days ago
      If it actually takes off, expect a vibecoded runtime that everyone runs on their own systems.
- awongh 34 days ago
  The other part of the question is exactly when the "build for the capabilities of future models" becomes the present.
  Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.
  Is it a year away, or five? That's a big difference in deciding what to build today.
- Joeri 34 days ago
  You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.
  [-]
  - AndyNemmity 33 days ago
    For sure, this is the pattern I use.
    And I wish I could make even more deterministic. Maybe I can, but it can also be a bit challenging to sort.
- krashidov 33 days ago
  > We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
  This is cool. Can you elaborate on it? Is it flaky? Does it take a long time?
- imtringued 33 days ago
  I'm personally surprised by this too. Like, everyone is writing how insanely productive AI is making them, but that productivity doesn't seem to have translated into any innovations beyond model quality.
  Like, most of the stuff needed to make AI better is stuff that could have been written by hand in 2015, so why hasn't anyone used their agents to do so?
  To be fair, there is probably a way to make it work the way you want. You could add an MCP for a task queue and let the model work each item in the task queue. The tasks could be added by a deterministic system i.e. your harness.
- jiehong 33 days ago
  This might be inherent to how the models are benchmarked.
  Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
  [-]
  - andyferris 33 days ago
    Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).
- sharperguy 34 days ago
  So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?
  So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.
  [-]
  - tonylucas 34 days ago
    I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.
    Probably not explaining it very well but I think it's pretty effective at reducing token usage.
    [-]
    - dkersten 33 days ago
      I've been building a workflow engine for agent orchestration and the workflows are just data for the engine to execute. While I haven't experimented with it yet, I envision that an LLM would be rather good at generating the workflows based on a description of your needs (and context about how best to utilise the workflow engine).
      LLM's are pretty good at reasoning about workflows, its just that when they have to apply them directly, the workflow context gets muddled with your actual tasks context. That's why using an orchestration agent that delegates work to worker agents works so much better.
      I still think there's a huge amount of value in having the workflow executed in a deterministic way (as code, or by a workflow engine) because it saves tokens, eliminates any possibility of not following it, and unlocks other cool things, like being able to give each step in the workflow its own focused task-specific context, splitting plans into individual actions and feeding them through a workflow one by one, and having workflow-step specific verification.
      But that workflow absolutely CAN be created by an LLM, it just shouldn't be executed by one.
    - shripadt 34 days ago
      [flagged]
  - peyton 34 days ago
    I make codex do everything through a giant `justfile`. Simple, greppable, self-documenting, works great, and I don’t even need to read it.
- pishpash 34 days ago
  Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.
  [-]
  - svachalek 34 days ago
    True. The prompt reads: Run the following Python: ```
- stpedgwdgfhgdd 33 days ago
  I m running into similar issues, more and more i’m removing complexity from the agent to the (Go) logic in order to make it more deterministic.
  To be more precise; everything is prepared in the form of files instead of letting the subagents making api/cli calls. And still - sometimes (even with enough context) the main agent takes strange turns.
- andy12_ 33 days ago
  Isn't this already possible to implement with skills and subagents? Like have a skill saying "to test these files run this script that executes a subagent for every markdown file, then check the results".
- rmaxdev 33 days ago
  The agent can do this one by one in an agentic loop, storing the progress and backlog in files
  If nothing is stored in durable memory then the context window is going to get rotten
- nvarsj 33 days ago
  I almost always use orchestration tooling nowadays. cc itself feels too basic, even with things like superpowers.
- vishna 33 days ago
  This was a great example, thanks
- zapataband1 34 days ago
  [flagged]
  [-]
  - BalinKing 33 days ago
    From the site guidelines (https://news.ycombinator.com/newsguidelines.html):
    > Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
rnxrx 34 days ago
I wonder if a part of the problem isn't just the misapplication of LLMs in the first place. As has been mentioned elsewhere, perhaps the agent's prompt should be to write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible. This would hopefully include validation of the agent's output as well. The overall goal would be to keep the LLM out of doing processing that could be more efficiently (and often correctly) handled programmatically.
[-]
- chrismarlow9 34 days ago
  100% agreed. use the non-deterministic thing that is right 90% of the time to generate a deterministic thing that is right 100% of the time. one of the key things I add to my prompts is:
  - Please consult me when you encounter any ambiguous edge cases
  Attaching the AI to production to directly do things with API calls is bad. For me the only use case where the app should do any AI stuff is with reading/categorizing/etc. Basically replacing the "R" in old CRUD apps. If you want to use that same new AI based "R" endpoint to auto fill forms for the "C", "U", and "D" based on a prompt that's cool, but it should never mutate anything for a customer before a human reviews it. Basically CRUD apps are still CRUD apps (and this will always be true), they just have the benefit of having a very intelligent "R" endpoint that can auto complete forms for customers (or your internal tooling/Jenkins pipelines/etc), or suggest (but never invoke) an action.
  [-]
  - TZubiri 34 days ago
    > Please consult me when you encounter any ambiguous edge cases
    Why not check the logprobs of the output and take action when the prob of the first and second most likely token is too similar? (or below a certain threshold?
    [-]
    - SpicyLemonZest 33 days ago
      I think you're getting abstraction layers mixed up, prediction uncertainty and logical uncertainty aren't the same. In a reasoning model, it's entirely possible that there's only one likely continuation and it says something like "This edge case is ambiguous, but what the user most likely meant is X".
    - jatora 34 days ago
      because this is manual? are you an llm?
- vishvananda 34 days ago
  I think there is a flow in most organizations from:
  llm -> prompt -> result
  llm -> prompt + prompt encoded as skill -> result
  llm -> prompt + deterministic code encoded as skill -> result
  I do think prompting to generate code early can shortcut that path to deterministic code, but we're still essentially embedding deterministic code in a non-deterministic wrapper. There is a missing layer of determinism in many cases that actually make long-horizon tasks successful. We need deterministic code outside the non-deterministic boundary via an agentic loop or framework. This puts us in a place where the non-deterministic decision making is sandwiched in between layers of determinism:
  deterministic agentic flows -> non-deterministic decision making -> deterministic tools
  This has been a very powerful pattern in my experiments and it gets even stronger when the agents are building their own determinism via tools like auto-researcher.
  [-]
  - Charles389no 20 days ago
    [flagged]
- evilelectron 34 days ago
  This is exactly how I did my last project of automating the generation of an interface library between a server that controls hardware and the mobile app.
  The hardware control team delivers a spec as a document and spreadsheet. The mobile team was using that to code the interface library and validating their code against the server. I converted the document to TSV, sent some parts to Claude and have it write a parser for the TSV keeping all the nuances of human written spec. It took more than 150 iterations to get the parser to handle all edge cases and generate an intermediate output as JSON. Then Claude helped me write a code generator using some custom glue on top of Apollo to generate the code that is consumed by the mobile app.
  This whole pipeline runs as part of Github actions and calls Claude only when our library validator fails. There is an md file which is sent to Claude on failure as part of the request to figure out what went wrong, propose a solution and create a PR. This is followed by a human review, rework and merge. Total credits consumed to get here < $350.
- VMG 34 days ago
  The problem is that often the program runs into some edge case that requires interpretation, at which point one is tempted to let the LLM deal with the edge case, at which point one is tempted to let the LLM deal with the whole loop and let it do the tool calls
  [-]
  - Fishkins 34 days ago
    Agreed. I think the approach described here is promising. Most of the workflow is deterministic and includes safeguards, but an LLM is invoked in the one case where it's really useful.
    https://lethain.com/agents-as-scaffolding/
- khasan222 34 days ago
  Completely agree! People tend to forget we are non deterministic too! Yet we are able to write code fine, and fairly reliably by using tools that can help keep us fairly honest.
  I think most problems with ai tend to be around can you deterministically test the thing you are asking it to do?
  How many of us would never ever show work, without going to check the thing we just built first?
  [-]
  - cluckindan 34 days ago
    > can you deterministically test the thing you are asking it to do?
    Of course: have it write tests first; and run them to check its work.
    Works well for refactoring, but greenfield implementations still rely on a spec that is guaranteed to be incomplete, overcomplete and wrong in many ways.
    [-]
    - pishpash 34 days ago
      You can't ask something to check its own work without external reward/penalty. It'll cheat.
      [-]
      - khasan222 34 days ago
        Weirdly, and i fully think this is just some cognitive bias I don't have the knowledge to name, the ai seems very happy to please me. Like when it gets something done in one shot, it seems very happy to do so.
        [-]
        daveguy 33 days ago
        It's because expressing emotion tests well in RLHF (reinforcement learning, human feedback), which is the layer on top of the next-token-predictor LLM. As a bonus, it helps manipulate operator reactions to incorrect output, and improve engagement (aka token use).
        The "thought process" of an LLM only exists as inference response to next token prediction prompts. It's the illusion of emotion.
    - khasan222 34 days ago
      Well if the spec is incomplete it sounds like you should lower scope for the AI, and then go from there. I wouldn't be too keen to give a junior engineer free reign and expect awesomeness
- nixpulvis 34 days ago
  My agents often write themselves scripts. Isn't that effectively what you're asking for? Prompting for scripts can also be a useful time and accuracy tactic when you know it'll be a good fit for it.
  [-]
  - falkensmaize 34 days ago
    The problem is that code it spits out on the fly is untested and untrustworthy. Identify the parts of your workflow that could be accomplished with regular code - write and unit test that code, with LLM help if you want, and use the llm as the orchestrator only.
  - sisve 34 days ago
    Yeah, the problem is that I do not think the agents is good at reusing scripts and stitching it together.At least for me it's recreating to much similar. I hope we will see platforms like windmill.dev find the optimal solution for this. I have not been able to test it enough. But have a platform that gives you some observability out of the box and protect secrets from llm is nice
    [-]
    - reddit_clone 34 days ago
      I noticed that too. Unless you _ask_ for a script, they throw away the scripts they write.
      They are particularly bad at complex multiline parsing. Writing all sorts of weird/crude python/awk scripts and getting confused in the process.
      I wish they would use Perl6/Grammer or Haskell/Parsec or similar and write better parsing scripts.
      [-]
      - quinnjh 33 days ago
        For the non haskell folks like myself, what would that look like/ why is parsing better? Perl i get
        [-]
        reddit_clone 29 days ago
        Perl has powerful regular expressions, but it only goes so far. Doing multiline/nested structured parsing is too painful.
        Perl6/Raku has built in grammers that can do that idiomatically.
        If you have a couple minutes, give this a glance. It will give you an idea.
        https://andrewshitov.com/2018/10/31/a-simple-parser-in-perl-...
        I am no expert in haskell either. But parsec is similar in concept.
- memjay 33 days ago
  This has been our experience as well. Initially we had a list of tools that the agent could use to manipulate a data structure in certain ways. This approach was quite brittle. Now we are using a small DSL (domain specific language) and a single tool where the agent can input scripts written in the DSL. We are getting more dynamic use-cases now and wrong syntax can easily be catched by the parser and relayed to the agent.
  [-]
  - HatchedLake721 33 days ago
    Do you have an example of type of data and DSL? I feel I’d just give it access to write python/js to manipulate data
    [-]
    - memjay 32 days ago
      We decided not to go with Python/JS to make executing safe and simple.
      The data structure is a recursive list of simple objects that form a table of content.
      DSL uses Python syntax though. For example:
      swap_section(a, b) create_section(after=2) delete_section(2)
      This proved to safe a ton of explanatory prompts that would be needed if every command was a tool instead. And it’s faster and more reliable.
- user3939382 34 days ago
  > write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible
  Correct. The concept of having probabilistic output with deterministic acceptance “guardrails” is illogical. If the domain resists deterministic modeling such that you’re using an LLM, the guardrails don’t magically gain that capability.
- groovetandon 34 days ago
  This is so true have been working on a project for exactly this principle -
  https://www.decisional.com/blog/workflow-automation-should-b...
  I think there is a fundamental incentive problem - code + llm + harness is bound to be more efficient but the labs want you to burn tokens so they are not going to tell you to use the code, just burn more tokens. They are asking us to forget about the token cost and reliability for now - model will become better.
  This means that most people just believe that their agent should just be able to do anything with the help of some Model fairy dust with prompts + skills.
  People need to watch their agents fail in production to be able to come to the right conclusion unfortunately.
  [-]
  - user34283 33 days ago
    Skills are not fairy dust but a combination of prompts and deterministic code, so that you get the best of both worlds.
    Eg. Loop in the code, process the subagents non-deterministic response for the individual task.
    This takes 10 minutes to set up, you just need to run something like /skill-builder and describe the desired workflow.
    I imagine many people just don’t know that it’s possible. I only discovered it a few days ago myself.
    It worked on the first try.
- foolserrandboy 34 days ago
  yup, the standard way of thinking about agents seems backwards and probably costly. Use LLMs to write scripts, then stick all your scripts in your own looping harness and call out for LLMs for those parts that are too hard to automate with some deterministic validation at the end.
- marcus_holmes 34 days ago
  We have a rule that the LLM cannot perform any actions that result in actual money or stuff moving. Those can only be done by API calls that have lots of validation and checks on them, and adding or changing an API call is gated behind human review. The LLM is then free to make as many API calls as it likes, we're confident that it can't screw anything up too badly.
bwestergard 34 days ago
I agree with the sentiment, but I think the conclusion should be altered. When you hit the limit of prompting, you need to move from using LLMs at run time to accomplish a task to using LLMs to write software to accomplish the task. The role of LLMs at run time will generally shrink to helping users choose compliant inputs to a software system that embodies hard business rules.
[-]
- scrappyjoe 34 days ago
  I’ve had a couple of weeks of downtime at work, so I decided to incorporate agents into my work processes - things like note taking, task tracking, document management.
  Your comment EXACTLY mirrors my experience. Week 1 was ever expanding prompts, and degrading performance. Week 2 has been all about actually defining the objects precisely (notes, tasks, projects, people etc) and defining methods for performing well defined operations against these objects. The agent surface has, as you rightly point out, shrunk to a translation layer that converts natural language to commands and args that pass the input validator.
- sowbug 34 days ago
  A full-circle system prompt would be to "find every opportunity to put yourself out of your job by automating it away. When you are given a question that code can answer, answer the question by writing code and running it to obtain the result."
  Such an LLM might have fared better with the strawberry test.
  [-]
  - Imanari 33 days ago
    That’s exactly the approach of smolagemts. The only “tool“ available is writing python code
- edgarvaldes 34 days ago
  Some have expressed the opinion in this forum that the future of software lies in programs that are created and adapted at runtime, using genAI. I don't know how far we are from that.
  [-]
  - aleksiy123 34 days ago
    It’s already here the question is just to what extent?
    Are google search results modifying your software at runtime?
    Take or agent chat for example, the output text is a ui, agents can generate charts and even constrained ui elements.
    Isn’t that created and adapted at run time?
    If you mean like agents live modifying your code. I think that’s pretty much here as well. Can read the logs and send prs.
    The only thing is how fast that loop will execute from days or hours to mins or seconds, and what validation gates it needs to pass.
    My git repo is pretty much self modifying personal software at this point, that I interface through the ide chat window.
    But I don’t think we will ever lose the intermediary deterministic language (code) between the llm and the execution engine.
    It would be prohibitively expensive to run everything through models all the time.
    But I am starting to think we need a more precise language than English when talking with LLMs. That can do both precision and ambiguity when you need either.
    [-]
    - jmaw 34 days ago
      Some kind of "code", you could say
      [-]
      - aleksiy123 34 days ago
        Yes but more declarative vs imperative.
        I say what the llm says how.
        [-]
        pishpash 34 days ago
        Not that long ago the workflow was to turn code comments into code. Maybe leave some comments as is now.
    - pishpash 34 days ago
      Sounds like assemblers bemoaning loss of control to C. The solution was inline assembly...
  - mjr00 34 days ago
    > Some have expressed the opinion in this forum that the future of software lies in programs that are created and adapted at runtime, using genAI.
    Good luck with that. Users will flood you with complaints if a button moves 5px to the left after a design update. A program that is generated at runtime, with not just a variable UI but also UX and workflows, would get you death threats.
    [-]
    - hilariously 34 days ago
      I think many software adjacent folks are super excited because they can now have the personalized toothbrush they keep asking people to make for them.
      The problem is that outside of that most people want boring and regular interfaces so they can get in and solve the problem and get out - they don't want to "love" it or care if its "sexy" they want it to work and get out of the way.
      LLMs transmogrifying your software at ever request assumes people are software architects and creators who love the computer interface, and that just doesn't describe the bulk of the population.
      Most people using computers use the to consume things or utilize access to things, not for their own sake, and they certainly don't think "what if I just had code to do x..." unless x is make them a lot of money.
    - munk-a 34 days ago
      A program that is generated at runtime is fine (we have interpreted languages and often compile on demand) - the issue is with the non-deterministic nature of the output.
      I think the core issue is that non-deterministic output is great for a chatbot experience where you want unpredictable randomness so it feels less like talking to the mirror - but when it comes to coding I think we're pretty fundamentally misaligned in sticking to that non-deterministic approach so firmly.
  - cassianoleal 34 days ago
    So we're back to vim over ssh in production, only without a human with _some semblance_ of judgement in the loop?
- QuercusMax 34 days ago
  I've seen cases where models will get stuck in a particular mode of problem solving and need a nudge to tell them to move to a new mode. For example, instead of trying to massage a bunch of system service configs to handle hot-plug/unplug of an audio stream, what I really needed was to just write a couple dozen lines of Python to handle stuff.
  I just had Claude write itself a couple shell scripts to handle a bunch of common cases (like running tests) in my workflow where it just couldn't figure it out efficiently. Now it just runs those tools and sets things up instead of spinning in circles for half an hour.
  Every time it tries to ask me if it can run some one-off crazy shell or python one-liner to do something, I've started asking myself if I should have it write a tool I can auto-approve instead.
- 3uba 34 days ago
  [dead]
- venturin 30 days ago
  [dead]
jerf 34 days ago
This is why I frequently refer to "next generation AIs" that aren't just LLMs. LLMs are pretty cool and I expect that even if we see no further foundational advancement in AIs that we're going to continue to see them exploited in more interesting ways and optimized better. Even if the models froze as they are today, there's a lot more value to be squeezed out of them as we figure out how to do that.
However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.
There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.
I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.
[-]
- cheesecakegood 34 days ago
  Actual memory, in my opinion. Right now memory is broadly speaking like a system of sticky notes the AI writes itself and checks every time, rather than an integrative system that allows learning and can trigger more flexibly.
  [-]
  - DmitriyBuchilin 20 days ago
    [dead]
- cultofmetatron 34 days ago
  heres a fun one for you https://www.youtube.com/watch?v=kYkIdXwW2AE&t=315s
gck1 34 days ago
As someone who went full circle prompt-enforcement > deterministic flow > prompt-enforcement, I disagree.
The reason why "DO NOT SKIP" fails is because your agent is responsible for too many things and there's things in context that are taking away the attention from this guidance.
But nobody said the agent that does enforcement must be the same agent that builds. While you can likely encode some smart decision making logic in your deterministic control flow, you either make it too rigid to work well, or you'll make it so complex that at that point, you might as well just use the agent, it will be cheaper to setup and maintain.
You essentially need 3 base agents:
- Supervisor that manages the loop and kicks right things into gear if things break down
- Orchestrator that delegates things to appropriate agents and enforces guardrails where appropriate
- Workers that execute units of work. These may take many shapes.
[-]
- ex-aws-dude 34 days ago
  Exactly, just keep adding more agents
  [-]
  - SrslyJosh 34 days ago
    I can't tell if this is satire or not. Well done!
    [-]
    - dnnddidiej 33 days ago
      It a heisenberg satire because more agents going wild is indeed horrible but agents restricting and counterbalancing each other can be useful (token cost ignored!).
- baxtr 33 days ago
  I think the key question is: How can you be sure the supervisor/orchestrator agents are reliable? You are just pushing the complexity down into another layer.
  [-]
  - dnnddidiej 33 days ago
    You can't be sure but the point is you can be more sure, since agent 2 ("agent" which is really just a fancy way of saying some code that calls anthropics api) has only the context to look for a violation of a single rule.
isityettime 34 days ago
Afaict all harnesses are wrong in this respect, some of them deeply so.
Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.
Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".
There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)
[-]
- dnautics 34 days ago
  > Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.
  I get what you're trying to say but in practice architecting what you propose is considerably more difficult. Why not build it and try to get hired by one of the bigcos?
  [-]
  - isityettime 34 days ago
    I don't think the basic architecture principles are novel. The big AI labs and other large tech companies already have engineers who can see this, without a doubt. But the AI labs clearly don't care if their LLM agents are just big balls of mud, and the big tech companies priorities mostly lie elsewhere, too.
    They just want features. They don't really care about duplicated work, so half of them reinvent the TUI rendering wheel. Pluggability is something that might be actually hostile to their interests in lock-in. And the AI labs probably think "after a couple more scaling cycles, our models will be so good that our agents can just rewrite themselves from scratch"; until they hit a compute or power wall, it always looks rational to them to defer rearchitecting.
    Another real possibility is that if you work on an agent with a really clean architecture and publish it in hopes of getting hired by some AI company, all of them think "that looks great, but we don't want to rearchitect right now". Your code winds up in the training set, and a year and a half from now, existing agents can "one-shot" rewrites along the lines of your design because they're "smarter".
    As for me, I'm not that interested, personally. There are other things I want to build and I'm working on those.
  - gf000 33 days ago
    In what way would it be more complicated? This is pretty basic concurrent programming, we routinely have much much more complex concurrent designs..
    Hell, a telegram bot can handle that just fine.
    [-]
    - dnautics 33 days ago
      Yeah, basic concurrent programming is not more complicated than basic linear programming
- user34283 34 days ago
  I use the Codex desktop app.
  In the GUI I can see the context indicator and usage stats.
  It also makes it easier to jump between conversations and see the updates.
  Sometimes I use Claude Code or opencode in the terminal, and my experience is much poorer compared to the Codex desktop app.
- the_duke 34 days ago
  In codex CLI /status works just fine during a turn.
  Other things don't though.
JohnMakin 34 days ago
> Imagine a programming language where statements are suggestions and functions return “Success” while hallucinating. Reasoning becomes impossible; reliability collapses as complexity grows.
This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.
Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.
If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.
[-]
- Terr_ 34 days ago
  > This is essentially declarative programming.
  I think it's step more-abstract that that, we're doing... How about "narrative programming"? (Though we could debate whether "programming" is still an applicable word.)
  Yes, it may look like declarative programming, but it's within an illusion: We aren't aren't actually describing our goals "to" an AI that interprets them. Instead, there's a story-document where our human stand-in character has dialogue to a computer-character, and up in the real world we're hoping that the LLM will append more text in a way that makes a cohesive longer story with something useful that can be mined from it.
  It's not just an academic distinction, if we know there's a story, that gives us a better model for understanding (and strategizing) the relationship between inputs and outputs. For example, it helps us understand risks like prompt-injection, and it provides guidance for the kinds of training data we do (or don't) want it trained on.
  [-]
  - JohnMakin 34 days ago
    I dont hate that distinction, I just think a lot of people are approaching this from an imperative framework that might not fit.
- repelsteeltje 34 days ago
  I was thinking of declarative, but PROLOG rather than SQL. So with actual control flow and reasoning capabilities.
  And then you run into similar issues as the llm does, like silent failures, loops, contradictions unless you're very careful.
  The essence might be the same closed world assumption problem. In llm case this manifests as hallucination rather that admitting it does not know.
- miltonlost 34 days ago
  SQL's declarativeness is also based on the mathematics of relational algebra, so it will return the same result every time. Will it return it in the same amount of time every single query? No, that depends on indexing and database size. But the query itself won't be altered in the same way an LLM would be.
  [-]
  - JohnMakin 34 days ago
    Engines that use SQL can vary drastically in how they handle strings, floating points, etc., where identical SQL queries on identical data absolutely can return different results, which is why I mentioned the engine underneath - LLM's being nondeterministic in addition to declarative is kind of tangential to the point I was trying to make.
    It is the same in terraform - yes, the HCL spec defines things very precisely, but you're kind of at the mercy of how the provider and provider API decide how to handle what you wrote, which can be very messy and inconsistent even when nothing changed on your side at all. LLM/agent usage feels a lot like that to me, in the sense it's declarative and can be a bit lossy. As a result there are things I could technically do in terraform but would never, because I need imperativeness.
    My main point being, I think people are trying to ram agents into a ton of cases where they might not necessarily need or even want to be used, and stuff like this gets written. Maybe not, but I see it day to day - for instance, I have a really hard time convincing coworkers that are complaining about the reliability of MCP responses with their agents, that they could simply take an API key, have the agent write a script that uses it, and strictly bound/define the type of response format they want, rather than let the agent or server just guess - for some reason there is some inclination to "let the agent decide how to do everything."
    I think that's probably what this article is getting at, but, I am saying that trying to create these elaborate control flows with validation checks everywhere to reign in an unruly application making dumb decisions, why not just use it to write deterministic automation instead of using agent as the automation?
- PaulStatezny 34 days ago
  I agree. But you can speak imperatively to agents as well ("Here are specific steps; follow them") and they can still screw up. :) I think what you're looking for is determinism, not imperativism.
  And to your point: instructing a (non-deterministic) LLM declaratively ("get me to this end state") compounds the likelihood of going off the rails.
  [-]
  - JohnMakin 34 days ago
    I don’t think I’m confusing the two but it is an issue. See another comment I made in a sibling comment - terraform is a great example or something that is declarative, and also non deterministic. You can’t control upstream api/provider changes even between two plans happening simultaneously - thats a lot what working with agents feels like to me.
59nadir 34 days ago
This was one of the key insights in Stripe's explanations about Minions[0], their autonomous agent system; in-between non-deterministic LLM work they had deterministic nodes that handled quality assurance and so on in order to not leave those types of things to the LLMs.
0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...
Neywiny 34 days ago
If you're trying to get reliability and determinism out of the LLM, you've already lost
[-]
- tekne 34 days ago
  Wait... why?
  Making an unreliable, nondeterministic system give reliable results for a bounded task with well-understood parameters is... like half of engineering, no?
  There's a huge difference between "generate this code here's a vague feature description" and "here's a list of criteria, assign this input to one of these buckets" -- the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
  [-]
  - JCTheDenthog 34 days ago
    >the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
    ...which is why we write deterministic code to take the human out of the pipeline. One of the early uses of computers was calculating firing tables for artillery, to replace teams of humans that were doing the calculations by hand (and usually with multiple humans performing each calculation to catch errors). If early computers had a 99% chance of hallucinating the wrong answer to an artillery firing table, the response from the governments and militaries that used them would not be to keep using computers to calculate them. It would be to go back to having humans do it with lots of manual verification steps and duplicated work to be sure of the results.
    If you're trying to make LLMs (a vague simulacrum of humans) with their inherent and unsolvable[1] hallucination problems replace deterministic systems, people are going to eventually decide to return to the tried and true deterministic systems.
    1: https://arxiv.org/abs/2401.11817
    [-]
    - tekne 24 days ago
      So how did we deal with the human mistakes? You mentioned it:
      - Get humans to check each other's work
      - Systematize the process -- breaking it down into smaller and smaller tasks where the likelihood of mistakes decreases
      - Replace as much as possible with deterministic code
      There's absolutely no reason you can't do this with LLMs -- and it might help quite a bit since LLMs are cheap. There's also hybrid systems -- where human checkers are replaced or augmented with LLM checkers.
      For example -- I have an LLM check all my scientific papers for typos and minor errors. It's caught quite a few, and when it caught something that was not actually an error, it was usually something whuch would benefit from clarification anyways.
      Now -- if I could afford to pay a grad student to do that, would be even better! But I can't, and if I could, not all the work which warrants a few cents of tokens warrants a few hundred dollars of tedious grad student labor -- especially when the latter has a very strong incentive to say LGTM (nothing here is life critical!)
      Likewise, we could imagine:
      - A deterministic process with a heuristic + an LLM in the loop checking, for example -- "is this likely correct?" -- perhaps escalating to a human (or a bigger LLM) in case of anomaly. I can see this being amazingly useful for automated refactors.
      - Automatic paperwork/customer service processing -- if the cost-of-failure can be bounded (say X$) and testing shows failure happens on average only reasonably often (say Y% of the time) -- it might be cheaper to run an AI system and eat that cost, especially if continuous monitoring lets you know if you have to "shut it down."
      In both cases -- there's nothing stopping an LLM from potentially having better-than-human average performance, and perhaps delegating real edge cases to actual experts. Remember: you're not competing with motivated PhDs, you're competing with minimum wage labor reading a list of instructions which is like a prompt except poorly formatted and missing steps.
  - Neywiny 34 days ago
    Because it's not possible. There is nothing you can say to the LLM that will guarantee that something happens. It's not how it works. It will maybe be taken into consideration if you're lucky.
    But if you're trying to tell me that every time you list criteria you get them all perfectly matched, you're clearly gifted.
    [-]
    - gf000 33 days ago
      I'm being deliberately pedantic, but depending on what kind of representation we use for the neural network (due to rounding) as well as the choice of inference (that is, given a distribution for next token, which one to choose), it can absolutely be reproducible and completely deterministic.
      Though chaotic, which I believe is the better word here - a single letter change may result in widely different results.
      We just choose to use more random inference rules, because they have better results.
      [-]
      - Neywiny 33 days ago
        With determinism you're not wrong. The problem is that you'd need to make sure all your seeds, temperatures, and other input parameters are exactly the same, and importantly that all context is cleared. But people don't do that. And I'm not sure every if even any provider lets you set those parameters.
        [-]
        gf000 33 days ago
        Even with temperature set to zero, I believe due to FP operations not being commutative you may still get non-determinism, so what I am talking about (as mentioned, very pedantically) is mostly the theory.
    - tekne 24 days ago
      "There is nothing you can say to the person that will guarantee that something happens"
- aleksiy123 34 days ago
  There’s a whole range between completely random and completely rule based deterministic.
  Somewhere in between that I guess is the varying levels of intelligence more likely able to make the “right” decision for anything you throw at it.
- evantbyrne 34 days ago
  I would hope that when engineers speak of LLM determinism they just mean it as shorthand for close to 1 under expected conditions
- pydry 34 days ago
  This is something I think some people are fundamentally not capable of understanding.
- sudosteph 34 days ago
  I mean, with reliability there's a spectrum. If the risks that an unreliable outcome brings aren't all that bad, then sometimes it's worth it to chase "my agents made an acceptable PR 70% of the time, can I get it to 90?"
  Determinism is a different matter. Scripts and hooks are really the main levers you can pull there, but yeah - a a decent script and a cron job will handle certain things much better (and for a fraction of the cost)
beshrkayali 33 days ago
Humble mention, I’ve been thinking the same thing with Ossature for the last couple of months since I started working on it: https://ossature.dev
The models are already good enough for code generation. What we need is the harness around them actually deterministically enforcing a specific path and “leashing” the models output to be aligned with the intention of the user as much as possible. You can’t make the output of the model deterministic, but you can make everything around it to be so.
Trying to make enforcements work with prompts is like a government agency investigating/auditing itself, there’s no incentive to find problems, so you’ll always inevitably get the “All Good, Boss!”
[-]
- toasty228 33 days ago
  > so you’ll always inevitably get the “All Good, Boss!”
  Or the opposite depending on how you ask the questions, some automated code review tools _always_ find issues, even when they don't really exist, or they exist in the scope of a function but not once the function is wired in the project.
plumbline 34 days ago
I've been thinking about this a lot actually. It can almost be related to the conversation about specialization. The more specialized a model is required to be, the less capable it seems to be at a foundational level, where as if you just aim towards a liiitle bit of abstraction, you might get the best of both worlds.
Here's a pretty specific example of what I mean, but maybe food for thought:
Podcast (20 minute digest): https://pub-6333550e348d4a5abe6f40ae47d2925c.r2.dev/EP008.ht...
Paper: https://arxiv.org/abs/2605.00225
cookiengineer 34 days ago
We have control flow. It's requirements specifications and test driven development. You just have to enforce it, so the agents cannot cheat their way around it.
I decided to build my agentic environment differently. Local only, sandboxed, enforced with Go specific requirement definitions that different agent roles cannot break as a contract.
That alone is far better than any hyped markdown-storage-sold-as-memory project I've seen in the last weeks.
Currently I am experimenting with skills tailored to other languages, because agentskills actually are kinda useless because they're not enforced nor can any of their metadata be used to predictably verify their behaviors.
My recommendation to others is: Treat LLM output as malware. Analyse its behavior, not its code. Never let LLMs work outside your sandbox. Force them to not being able to escape sandboxes. And that includes removing the Bash tool, for example, because that's not a reproducible sandbox.
Also, choose a language that comes with a strong unit testing methodology. I chose Go because it allows me to write unit tests for my tools, and even agents to agents communication down the line (with some limitations due to TestMain, but at least it's possible).
If you write your agent environment or harness in Typescript, you already failed before you started. Compiled code isn't typesafe because the compiler doesn't generate type checks in the resulting JS code.
Anyways, my two cents from the purpleteaming perspective that tries to make LLMs as deterministic as possible.
dkersten 33 days ago
This is something I realised late last year while using Claude Code. The LLM shouldn't be the one in control of the workflow, because the LLM can make mistakes, skip steps, hallucinate steps, etc. Its also wasteful of tokens.
I'm a firm believer that a "thin harness" is the wrong approach for this reason and that workflows should be enforced in code. Doing that allows you to make sure that the workflow is always followed and reduces tokens since the LLM no longer has to consider the workflow or read the workflow instructions. But it also allows more interesting things: you can split plans into steps and feed them through a workflow one by one (so the model no longer needs to have as strong multi-step following); you can give each workflow stage its own context or prompts; you can add workflow-stage-specific verification.
Based on my experience with Claude Code and Kilo Code, I've been building a workflow engine for this exact purpose: it lets you define sequences, branches, and loops in a configuration file that it then steps through. I've opted to passing JSON data between stages and using the `jq` language for logic and data extraction. The engine itself is written in (hand coded; the recent Claude Code bugs taught me that the core has to be solid) Rust, while the actual LLM calls are done in a subprocess (currently I have my own Typescript+Vercel AI SDK based harness, but the plan is to support third party ones like claude code cli, codex cli, etc too in order to be able to use their subscriptions).
I'm not quite ready to share it just yet, but I thought it was interesting to mention since it aims to solve the exact problem that OP is talking about.
[-]
- user34283 33 days ago
  I‘ve recently started to use skills and so far it’s been working great.
  Your agent can write a python script to loop and simply call „claude -p“ or „codex exec“.
  For simple workflows this seems good enough and can be set up in 10 minutes without third party software.
  What do you think?
  [-]
  - dkersten 33 days ago
    For simple workflows or once-off workflows, that's a good approach.
    For long running repeatable workflows (eg you want to leave your agent running over night, you want to run the same workflows over and over in different projects, or more autonomous Devin-like workflows) or you want audit trails/observability, vetted workflows (ie not have the LLM write them; or have the LLM write them and you review them) without having to read through scripts, or you have more complex requirements like having different models/providers for different workflow stages or the things I mentioned previously (context, plans, verification, etc), or you have more complex workflow needs (swarms or fork/join, parallel pipelines, routing/branching, error recovery or routing, etc) then a robust dedicated workflow engine is needed in my personal opinion.
    I think for most users using claude/codex for themselves on smallish projects, its unnecessary, but was you scale up, I feel that more powerful tools are needed. Also, for corporate, where you need repeatable workflows with audit trails, artefact management, and job queue based task management starts becoming more important too.
    I also feel that using a workflow engine as an internal behind-the-scenes system in a GUI-centric vibe coding tool might also help raise the ceiling compared to the existing tools, but I've yet to test that hypothesis. Just because it takes the mistakes out of the users hands: the engine will follow proven workflows, whether you ask it to or not, keeping skills for context/knowledge, not for orchestration.
    Something else I've been experimenting with a little, but not enough yet to have an opinion, is small language models running locally for orchestration, and frontier models for doing work.
apalmer 34 days ago
Generally agree with this stance case in point: the breakthrough in ai coding was not that AI intelligence increased as much as that a lot of the core process execution moved out of the LLM prompt and into the harness.
rglover 34 days ago
> Babysitter: Keep a human in the loop to catch errors before they propagate.
This is the only way to guarantee AI usage doesn't burn you. Any automation beyond this is just theater, no matter how much that hurts to hear/undermines your business model.
A bird sings, a duck quacks. You don't expect the duck to start singing now, do you?
[-]
- kelseyfrog 34 days ago
  I'm not sure I agree. Like all stochastic processes, LLM errors can be quantified. That makes each use case a risk-reward tradeoff where users can decide if the tradeoff makes sense for them or not. There are scenarios where errors are acceptable because the risks are low or errors are acceptable or the rewards make up for them. This is a process engineer problem where business and technology specifics matter.
  [-]
  - rglover 34 days ago
    I see where you're coming from, but this assumes good behavior and discipline which most people/teams struggle with.
    If a business can get away with some margin of error being acceptable, more power to them. But if not (or doing so would cause additional problems; what I'd imagine to be true for a non-trivial number of orgs), it's wise to consider the nature of the tool a lot of people are suggesting is mandatory if you're dependent on consistent, predictable results.
    [-]
    - kelseyfrog 34 days ago
      That's fair. A heuristic that leaves some opportunity on the table due to org capability is a reasonable one to have.
- alasano 34 days ago
  I think babysitting LLMs is exactly the thing that burns you.
  Presuming you meant burns you out though.
  [-]
  - doubled112 33 days ago
    No, "burns you" as in "play with fire and you'll get burned".
    It will make a mistake and you will get burned, so you have to babysit it.
moconnor 34 days ago
“Flow” moves agents through a yaml flowchart of prompts and decisions. It’s working quite well for a couple of us in Tenstorrent, more to discover here though:
https://github.com/yieldthought/flow
Happily, 5.5 is good at writing and using it.
[-]
- aryehof 33 days ago
  I find Flow really interesting, thanks for pointing it out.
  Deterministic workflows using AI to help perform those steps not requiring human input has been an area of interest for me for some time. Particularly interesting how you are using the AI to determine what a step has achieved and the action of the next step.
  Combine it with workflow elements that does handle human steps together with a notification/routing/task system would make for a helpful system for so many.
bandrami 33 days ago
It's going to be hilarious in a few years when people are still using LLMs but only via a controlled vocabulary and syntax that you have to learn. It's just like how everybody moved to NoSQL 15 years ago but immediately recreated schemae in their JSON.
kenjackson 34 days ago
I feel like people forget that they're still allowed to program. You're still allowed to create workflows tying together LLMs and agents if you want. Almost all the tools and technology that existed before LLMs are still available to be used.
Nizoss 34 days ago
If you’re interested in such deterministic scaffolding/control flow, check out Probity.
I created it to address this exact issue. It is a vendor-neutral ESLint-style policy engine and currently supports Claude Code, Codex, and Copilot.
It uses the agents hooks payloads and session history to enforce the policies. Allowing it to be setup to block commits if a file has been modified since the checks were last run, disallow content or commands using string or regex matching, and enforce TDD without the need of any extra reporter setup and it works with any language.
Feedback welcome: https://github.com/nizos/probity
venturin 30 days ago
Strong agreement on the thesis. The piece is most useful for naming what's bankrupt about prompt chains. Where it stops short is what the verification checkpoints should actually verify.
One way to slice it: there are three kinds of underspecification an agent has to close.
Intent: what the user wanted (JWT vs cookies, should free users see this feature). Verification can't close this and probably shouldn't try.
Structural e.g. null, types, exhaustiveness, ownership. Sound static analysis closes this by construction.
Domain e.g. auth on every route, error propagation, contract stability. A domain-shaped apparatus closes this because it knows what kind of program is being built.
Babysitter, auditor, prayer is the right taxonomy of bad options. The fourth option is making the LLM a component inside an apparatus that handles structural and domain statically, and leaving the human on intent.
sudosteph 34 days ago
This is a good discussion topic. A lot of people really seem to believe that if you word a prompt just so, that you just need to throw a high-powered model at it, it will work consistently how you want. And maybe as models progress that might be the case. But right now, that's not how I've seen real life work out.
Even skills are not a catch-all, because besides the supply chain risk from using skills you pull from someone else, a lot of tasks require an assortment of skills.
I've accommodated this with my agent team (mostly sonnets fwiw) by developing what we call "operational reflexes". Basically common tasks that require multiple domains of expertise are given a lockfile defining which of the skills are most relevant (even which fragment of a skill) and how in-depth / verbose each element needs to be to accomplish the same task the same way, with minimal hallucinations or external sources.
A coordinator agent assigns the tasks and selects the relevant lockfile and sends it along or passes it along to another agent with a different specified lockfile geared towards reviewing.
It's a bit, but this workflow dramatically increased the quality of output for technical work I get from my agents and I don't really need to write many prompts myself like this.
tim-projects 34 days ago
This is exactly the problem I've been working on and I see others are too. When you implement quality control gates, everything works better. It solves so many of the basic problems llms create - saying code is finished when it isn't. Skipping tests, introducing code regressions, basic code validation etc
I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).
[-]
- Nizoss 34 days ago
  Exactly! I don’t babysit TDD anymore. I have another agent that does that for me and honestly sometimes catches things I would have missed if I was the babysitting.
  Hooks do wonders here. The payload contains a lot of information about the pending action the agent wants to make. Combine that with the most recent n events from the agent’s session history and you have a rich enough context to pass to another agent to validate the action through the SDK.
  This way the validation uses the same subscription you’re logged in to, whether you’re using Claude Code, Codex, or Copilot. The validation agent responds with a json format that you can easily parse and return, allowing you to let the action through or block it with direction and guidance. I’m genuinely impressed by how well this works considering how simple it is.
  You can find my approach here: https://github.com/nizos/probity
- DmitriyBuchilin 20 days ago
  [dead]
socketcluster 34 days ago
That's why I built https://saasufy.com/ as an agent tool for building data-driven realtime apps.
I started working on it piece by piece about 14 years ago. It was originally targeted at junior developers to provide them the necessary security and scalability guardrails whilst trying to maintain as much flexibility as possible. It's very flexible; most of Saasufy is itself is built using Saasufy. Only the actual user service and orchestration is custom backend code.
Also, I designed it in a way that it would help the user fast-track their learning of important concepts like authentication, access control, schema validation.
It turns out that all of these things that junior devs need are exactly what LLMs need as well.
I tested it with Claude Code originally and got consistently great results. More recently, I tested with https://pi.dev with GPT 5.5 and it seemed to be on par.
ModernMech 34 days ago
Slowly and surely we are replacing AI with programming languages.
andai 33 days ago
Yeah, you could also see this in 2023 with Auto-GPT. People were letting GPT "drive" when what they actually needed, in most cases, was like ten lines of Python (and maybe a few calls to a llm() function).
The alternative is running your ten lines of Python in the most expensive, slowest, least reliable way possible. (Sure is popular though)
For example, most people were using the agents for internet research. It would spin for hours, get distracted or forget what it was supposed to be doing.
Meanwhile `import duckduckgo` and `import llm` and you can write ten lines that does the same thing in 20 seconds, actually runs deterministically, and costs 50x less.
The current models are much better -- good enough that the Auto-GPT is real now! -- but running poorly specified control flow in the most expensive way possible is still a bad idea.
kristianp 33 days ago
I have some notes for a blog along the same lines, called "Determinism vs Agents". I had the same experience with MANDATORY.
Agents are also very slow compared to code. By the time it takes for the agent to ingest the system prompt + your prompt then to send a tool call to search for files in your repo, then another call to find a few patterns in those files, 30 seconds or more have passed. A non-agentic harness like Aider does that step a lot faster. Then it always does checkin of its changes. It doesn't have the flexibility to also run specific commands like code coverage checks from example. Something in between Claude Code and Aider would be useful.
astrobiased 34 days ago
It's the right direction, but control flow introduces limitations within a system that is quite adaptable to dynamic situations. The more control flow you try to do, the more buggy edge cases that pop up if done poorly.
Still have yet to see a universal treatment that tackles this well.
[-]
- TuringTest 34 days ago
  I would just reverse the architecture of the whole system. Build a classic deterministic program, and use LLMs as heuristics adapting the system to the environment - the functions that you call on the 'if's and 'switch' statements to decide where the system should go.
  I see this as the most robust way to build a predictable system that runs in a controlled way while taking advantage of probabilistic AIs while reducing the impact of their alucinations.
  LLMs simply can't be trusted to follow instructions in the general case, no matter how much you constraint them. The power of very large probabilistic models is that they basically solved the _frame problem_ of classic AI: logical reasoning didn't work for general tasks because you can't encode all common sense knowledge as axioms, and inference engines lost their way trying to solve large problems.
  LLMs fix those handicaps, as they contain huge amounts of real world knowledge and they're capable of finding facts relevant to the problem at hand in an efficient way. Any autonomous system using them should exploit this benefit.
illwrks 34 days ago
I’ve been building a small ‘agent’ using copilot at work, partly a learning exercise as well as testing it in a small use case.
My personal opinion is that AI and agents are being misrepresented… The amount of setup, guidance and testing that’s required to create smarter version of a form is insane.
At the moment my small test is: Compressed instructions (to fit within the 8k limit) 9 different types of policies to guide the agent (json) 3 actual documents outlining domain knowledge (json) 8 Topics (hint harvesting, guide rails, and the pieces of information prepared as adaptive cards for the user) 3 Tools (to allow for connectors)
The whole thing is as robust as I can make it but it still feels like a house of cards and I expect some random hiccup will cause a failure.
[-]
- dnh44 34 days ago
  To be honest Copilot really stinks and is really far from the sharp edge of what is possible these days.
  [-]
  - illwrks 33 days ago
    100%, have to use it for work though!
morpheos137 33 days ago
It speaks to how dumbed down the human userbase has become that these kind of articles are even presented as insightful. "Agents" are not intelligent. they are pattern extrapolators. If you want a reliable deterministic output you need a deterministic harness. Think of agents as a montecarlo sampling tool. The harness defines the result over noise. it is hilarious to me the industry is going head long into more "intelligent" agents while ignoring that intelligence is an adaptation to constraints not some magical abstract general thing that just appears and can do useful work. AGI is a lie. Stocastic parrots + harness is a useful tool.
xuhu 34 days ago
It sounds like the "app written in C++ calling Lua scripts, versus app written in Lua calling C++ libraries" debate.
Both designs (Lightroom, game engines) have worked successfully.
There's probably nothing that prevents mixing both approaches in the same "app".
[-]
- QuercusMax 34 days ago
  This pattern has been described for decades: https://wiki.c2.com/?AlternateHardAndSoftLayers. It's not just a matter of who's in control - you can layer these things.
encoderer 34 days ago
You can get a lot done with agentic programming without going "all in" on a gastown-like system, but I think there is a minimum viable setup:
1. an adversarial agent harness that uses one agent to create a plan and implement it, and another to review the plan and code-review each step.
2. an agentic validation suite -- a more flexible take on e2e testing.
3. some custom skills that explain how to use both of those flows.
With this in place you can formulate ideas in a chat session, produce planning artifacts, then use the adversarial system to implement the plans and the validation layer to get everything working e2e for human review.
There are a lot of tools you can use for these things but I chose to just build the tooling in the repo as I go.
[-]
- Schiendelman 34 days ago
  Claude already creates multiple agents for some projects just to keep context windows smaller. I don't think it'll be long before they offer a testing agent along with their planning agent.
  [-]
  - encoderer 34 days ago
    I prefer having codex author plans and implement, and claude play reviwer. I do swap them from time to time and i have a lot of respect for claude 4.6 and 4.7 but for my domain I think codex does a better job with the authoring.
    [-]
    - Schiendelman 34 days ago
      That's a cool idea! Plus I bet you can stay in lower tiers with both?
      [-]
      - encoderer 34 days ago
        You're definitely burning more tokens with the back/forth and multi-step approach but assuming you swap who does the authoring from time to time you can definitely get the max out of each plan. Review doesn't use as many tokens.
arian_ 34 days ago
Control flow tells the agent what it's allowed to do. It doesn't tell you what the agent actually did. Both matter. Everyone is building the permission layer. Almost nobody is building the verification layer.
[-]
- allynjalford 33 days ago
  I am...
nickstinemates 33 days ago
This is why we built swamp[1].
Swamp teaches your Agent to build and execute repeatable workflows, makes all the data they produce searchable, and enables your team to collaborate.
We also build swamp and swamp club using swamp. You can see that process in the lab[2]. This combines all of the creativity of the LLM for the parts that matter, while providing deterministic outcomes for the parts you need to be deterministic.
1: https://swamp.club
2: https://swamp.club/lab
rbren 34 days ago
If you're interested in driving coding agents with code, check out the OpenHands Software Agent SDK [1]
We need to define agents in code, and drive them through semi-deterministic workflows. Kick subtasks off to agents where appropriate, but do things like gather context and deal with agent output deterministically.
This is a massive boost in accuracy, cost efficiency, AND speed. Stop using tokens to do the deterministic parts of the task!
[1] https://github.com/OpenHands/software-agent-sdk
[-]
- zapataband1 34 days ago
  "conversation.send_message("Write 3 facts about the current project into FACTS.txt.")"
  why tf would i ever need this
onion2k 34 days ago
Agents are probabilistic systems. A common mechanism to get a reliable answer from systems that can have variable output is to run them several times (ideally in separate, isolated instances) and then have something vote on the best result or use the most common result. This happens in things like rockets and aviation where you have multiple systems giving an answer and an orchestrator picking the result.
I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.
[-]
- suprfnk 34 days ago
  But then, if an agent picks the best response, how would you know that that is reliable?
  [-]
  - xienze 34 days ago
    Obviously you have multiple agents justify why they picked a certain response and then create another agent that picks the solution with the best justification.
    [-]
    - kkyr 34 days ago
      touché
    - DmitriyBuchilin 20 days ago
      [dead]
  - onion2k 34 days ago
    You could get the agents to output something structured and then use a deterministic test if you're worried about that.
- Yokohiii 34 days ago
  An LLMs "wrong" decision is either systemic or biased. They learn "common sense" from human input (i.e. shared datasets, reinforcement learning). If a decision is flat out wrong for you, asking 10 LLMs is unlikely to help.
k__ 33 days ago
At my new job, I was assigned to improve processes with AI.
My first thought was, well agents seem nice, but I think, AI workflows are a better bet. However, I don't really understood AI or agents in depth and felt like I was just "doing things the old way" and removing flexibility from agents was a ridiculous idea.
After some research I got the impression that I was right. A well defined workflow and scope is just what's needed for AI. It's cheaper and more consistent. It probably even makes the whole thing run well with non-SOTA models.
briga 34 days ago
Sometimes it feels like Agents are just reinventing microservices. Except they are are doing it in the most inefficient way possible. It is certainly a good way for the LLM companies to sell more tokens
srid 32 days ago
I created https://agency.srid.ca/ to achieve some of this but from within a single agent CLI session
Here's a recent PR created end-to-end using `/do` workflow of agency: https://github.com/srid/emanote/pull/719
zby 34 days ago
I concur - it does not make sense to do in llm prompts what can be done in code. Code is cheaper, faster, deterministic and we have lots of experience with working with code.
Especially all bookkeeping logic should move into the symbolic layer: https://zby.github.io/commonplace/notes/scheduler-llm-separa...
dirtbag__dad 34 days ago
Build CLIs your agents call, that scaffold what you want, and lint so it actually does achieves your intended design.
Markdown files are a good reference but they are a weak enforcement tool and go stale easily.
Avoid burying yourself in more skills docs you’re not even writing yourself and probably never even read. Focus that toward deterministic tooling. (Not that skills or prompts are bad, I agree a meta skill that tells an agent what subagents and what order to run is useful)
[-]
- zapataband1 34 days ago
  lol so write an actual deterministic program? we're close to full circle
  [-]
  - noisy_boy 33 days ago
    Yes but with the "judgement" to call them. If you put "review the results based on conditions described here and anything else suspicious you may spot before call the <next_deterministic_program>", it should be able to catch some case you didn't think about in your standard checks. Of course it may miss out on those or have false positives but that is the nature of the beast, as it is now.
Weryj 33 days ago
Pure agentic loops with markdown documents as a program 'agentic workflow' is incredible for experimentation, developing and testing your workflow idea.
The second it works, bake the workflow into the harness. Yesterday I was doing just that, and the whole agent loop disappeared because the process could've been condensed into a one-shot request (+1 MorphLLM fast apply) from careful context construction. (It was an Autoresearcher)
yogthos 34 days ago
This was basically my realization as well. We are trying to get LLMs to write software the way humans do it, but they have a different set of strength and weaknesses. Structuring tooling around what LLMs actually do well seems like an obvious thing to do. I wrote about this in some detail here:
https://yogthos.net/posts/2026-02-25-ai-at-scale.html
[-]
- flowgrammer 34 days ago
  My experimentation with Verblets also concluded plain functions are the most logical harness for LLMs.
gardnr 34 days ago
This is straight outta 2023:
Agents aren't reliable; use workflows instead.
mnalley95 34 days ago
Own your control flow! A key point from 12 factor agents.
"One thing that I have seen in the wild quite a bit is taking the agent pattern and sprinkling it into a broader more deterministic DAG." - https://github.com/humanlayer/12-factor-agents/blob/main/REA...
cadamsdotcom 31 days ago
Yep. Deterministic shell around the powerful abilities of the model.
Define what a good job looks like, unskippable steps, etc etc - essentially what your process is for producing your desired output in a reproducible way.
Then codify it. Have the model write code and wrap the model in a harness that ensures said code runs when you need it to, every time.
juanre 33 days ago
Absolutely agree. However, if you do not need absolute reliability pairs of agents are much better than single agents. These days I always have one agent coding and another code-reviewing. The code reviewer is also the holder of the lamp, keeping track of the final goal. This is applicable to whatever task you want your agents to achieve: one works, the other looks over the shoulder.
cloaky233 33 days ago
It's not that agents don't need more prompts, actually breaking the prompt into a dynamically changing prompt and a static prompt combination does resolve most of the issues. Control flow on the other hand is harnessing + context building, which is one major part of agentic workflows. So I believe a "optimized" combination of both is what we should be looking for.
throawayonthe 33 days ago
i gave in and bought a month of claude (it really is a slot machine don't do it if you have an addictive personality lol) to vibecode a bit, and the Superpowers skill set is cool and all but it really seems like something that should be turned into a program
hmmmmmm maybe i could vibecode a harness based on that pi thing i've heard about, and integrate it closer with jj instead of relying on llms knowing how to use it, and make certain stages guaranteed to run... oh dear
edit: also i can't bring myself to believe the 'ultimate' form or whatever stabilizes out will be chat-based interfaces for coding and code generation
i think it's just that openai happened to strike gold with ChatGPT and nobody has time to figure anything else out because they've got to get the bazillion investor dollars with something that happens to kinda work
also afaiu all these instruct models are based on 'base' models that 'just' do text prediction, without replying with a chat format; will we see code generation models that output just code without the chat stuff?
niyikiza 34 days ago
My analogy[1] has been that we need a valet key: capped speed, geofenced, short ttl, can't open trunk/glovebox, etc. That way we don't have to say pretty please to the valet and hope that they won't get ideas.
[1] https://niyikiza.com/posts/capability-delegation/
astra_omnia 34 days ago
I think this also points to what needs to exist after the control-flow layer. Once an agent executes a bounded workflow, teams still need a reviewable object showing what authority/scope it had, what artifacts it touched, what validation ran, what evidence was retained, and what limitations remain. Logs are useful, but they are not the same thing as an action receipt.
kmad 34 days ago
This is, at least in part, the promise of frameworks like DSPy and PydanticAI. They allow you to structure LLM calls within the broader control flow of the program, with typed inputs and outputs. That doesn’t fix non-determinism, hallucinations, etc., but it does allow you to decompose what it is you’re trying to accomplish and be very precise about when an LLM is called and why.
pron 34 days ago
How do you have "aggressive error detection" when one of the most common and pernicious mistakes agents make are architectural? The behaviour is fine, but the code is overly defensive, hiding possible bugs and invariant violations, leading to ever more layers of complexity that ultimately end up diverging when nothing can be changed without breaking something.
alasano 34 days ago
I'm building a robust runtime for this.
It's externally orchestrated and managed, not by an agent running the the loop.
The goal is to force LLMs to produce exactly what you want every time.
I will be open sourcing soon. You can use whatever harness or tools you already use, you just delegate the actual implementation to the engine.
https://engine.build
danieljhkim 33 days ago
Sharing something that I am building right now for this: - https://github.com/danieljhkim/orbit - https://orbit-cli.com/
Any feedbacks are welcome
EGreg 33 days ago
Seems more and more people are coming to the same realization:
https://community.safebots.ai/t/prominent-people-come-to-the...
chandureddyvari 34 days ago
I had good success with hooks in claude code. Personally I feel this problem was common with humans as well. We added tools like husky for git commits, for our peers to push code which was linted, type checked etc.
I feel hooks are integral part of your code harness, that’s only deterministic way to control coding agents.
[-]
- Nizoss 34 days ago
  I fully agree. Also started using husky before expanding further and created my own hooks. I can’t imagine myself using agents today without them, it would require a lot of babysitting.
piyh 33 days ago
9 different frameworks being pushed in the comments of this thread. 2026 truly is the year of agents.
[-]
- yangbiaogaoshou 33 days ago
  which 9 frameworks?
sbinnee 33 days ago
I have been telling this to my team that 1000 lines of instructions are deemed to fail no matter how great of instruction following capability of a model. I have been reviewing hundreds of line changes daily basis for about a month. I couldn’t help becoming a prayer.
ltbarcly3 34 days ago
Don't listen to anyone who knows what should be done without proof. If someone 'knows' what agents 'need' then that knowledge is worth millions of dollars right now. If they haven't built it they are probably just talking shit.
mf_kevintruong 33 days ago
Correct, that we we should need something like DAG , kanban flow for control agent , there are deterministic combine with undeterministic
the control flow need to be bind with undeterministic agent to keep thing strict but need to flexible enough
solomonb 34 days ago
I agree and I think a really wonderful way to encode agentic control flow would be with Polynomial Functors.
https://arxiv.org/abs/2312.00990
hmaxdml 34 days ago
We've found that durable workflows is a much needed primitive for agents control flow. They give a structure for deterministic replays, observability, and, of course, fault tolerance, that operators need to make the agent loop reliable.
zingar 33 days ago
This is a refreshing take but I’d really have liked an example for contrast.
est 33 days ago
I have a question, does LLM follow these MANDATORY or DO NOT SKIP during pre-train, like how people write a comment paragraph on reddit corpus, or is it just some post-train alignment habbit?
[-]
- stingraycharles 33 days ago
  Instruction following is a specific fine tuning / post training phase, yes.
  That’s why you see “base” vs “instruct” models for example — base is just that, the basic language model that models language, but doesn’t follow instructions yet.
  Especially the open weights models have lots of variants, eg tuned for math, tuned for code, tuned for deep thinking, etc.
  But it’s definitely a post train thing, usually done by generating synthetic data using other models.
vitlyoshin 33 days ago
This feels right. Once an agent touches a real business workflow, prompts become only one layer. Reliability comes from state, validation, observability, and explicit failure handling.
danborn26 33 days ago
Relying on prompt engineering for logic is incredibly fragile. Explicit state machines and programmatic routing provide the predictability that complex agents actually require.
arbirk 34 days ago
I always wonder with these posts: - are they talking about coding (where I am the control flow) - or RPA agents (in which it is obvious) ? - also don't use llm for deterministic tasks
colek42 34 days ago
We built https://aflock.ai/ (open source) to help with this. Constraining activity tends to work well
noashavit 33 days ago
100% Agents need reliable state management, conditional logic, and structured execution paths. Prompt engineering is a surface layer solution to a deeper problem
idivett 34 days ago
Isn't that what they call "Harness engineering"?
allynjalford 33 days ago
Totally agree. That's why i built it. https://backpac.xyz/cairn-cli
glasner 34 days ago
This exactly why I’m building aiki to be a control layer for harness execution. I don’t think the model companies will ever give us the neutral layer we need.
jarboot 34 days ago
I think this is a good usecase for temporal + pydantic-ai
mohamedkoubaa 34 days ago
Eventually we'll all come to the inevitable conclusion that for a task to be fully automated there should be neither human nor genie in the loop.
SrslyJosh 34 days ago
> "Agents need control flow, not more prompts"
Can't wait for ya'll to come full circle and invent programming from first principles.
graphememes 34 days ago
> If you’ve ever resorted to MANDATORY or DO NOT SKIP, you’ve hit the ceiling of prompting.
using this is going to do the opposite of what you want
trolleski 33 days ago
Maybe we could devise a language which would be like a natural language but have some pretty neat formal properties... Wait...
aykutseker 34 days ago
all caps in a prompt is a code smell. when you're typing MANDATORY, you should be writing a wrapper, not refining the prose.
[-]
- Nizoss 34 days ago
  Exactly! I have said this a couple of times but it was taken literally as in no capital letters or strong language. Glad to see someone else who shares this perspective.
shivnathtathe 33 days ago
Observability is the missing piece here — built opensmith for exactly this reason, tracing agent control flow locally
cesarvarela 34 days ago
This will remain a persistent problem without a definitive answer until models move from generative tools to actual AI.
dnautics 34 days ago
Yes. Humans are also unreliable and nondeterministic (though certainly more reliable). Accordingly we have built software dev practices around this. I imagine it would be super useful for example to have a "TDD enforcer":
Phase 1: only test files may be altered, exactly one new test failure must appear.
Phase 2: only code files may be altered. The phase is cleared when the test now succeeds and no other tests fail.
If you get stuck, bail and ask for guidance
[-]
- ManWith2Plans 34 days ago
  I've been busy building and dogfooding open-artisan for my own development purposes. I've diverged quite a bit from main and am hoping to merge some of those changes back soon. It's basically an OpenCode plugin that forces open-code token-hungry state machine that tries to map the engineering process I follow, exposing only valid tools and states at every step of development. If you're interested, in following along or trying it out, it's available here:
  https://github.com/yehudacohen/open-artisan/
  Hopefully, I'll merge in my large structural changes in the next couple of weeks. These structural changes will enhance the state machine meaningfully, as well as adding support for hermes agenet.
Imanari 33 days ago
As with so many things aider.chat was ahead of its time with its ability to create deterministic scripts.
pjmlp 33 days ago
Which is exactly what tools like n8n, langflow, opal, workato and many other offer.
Did the author miss up on them?
2001zhaozhao 34 days ago
If we need control flows, then designing these control flows ought to be the future of agent engineering
mhotchen 34 days ago
HUMANS need control flow. It's a very effective strategy that has worked wonders in healthcare
[-]
- stonewizard 34 days ago
  [dead]
rickysahu 33 days ago
we work on this issue in healthcare (genhealth.ai) where it's imperative to get every step correct. not easy. a valuable solution at the intersection of browser, code, lmms. there r far more layers of browser interaction than just imgs and dom.
geon 34 days ago
How is this not obvious to everyone? It's like people forgot how to engineer.
empath75 34 days ago
I have heard this sort of thing a lot from people working with agents, and I just think it's fundamentally misguided as a way to think of them, and if you work with them this way, you are probably setting money on fire for no reason because the tasks you are able to perform this way _don't need agents to begin with_.
You might use an LLM api call here as a translation or summary step in a deterministic workflow, but they are not acting as agents, because they lack _agency_.
The value of using an agent harness is precisely that they are _not deterministic_. You provide agents a goal, tools and constraints and they do the task they were asked to perform as best as they can figure out how to do it. You may provide them deterministic workflows as tools they can call, but those workflows, outside of the agent harness itself, should not constrain what the agent does. You are paying a lot of money for agent reasoning, not to act as an expensive data transformation pipeline.
It may be the case that a lot of agentic workflows are more properly done with fully deterministic workflows, but the goal there should be to _remove the agents entirely_ and spend those tokens on non deterministic tasks that require agentic decision making.
I do think there are fundamental limits to what agents are capable of doing unsupervised and there does need to be a lot more human guidance, observability and control over what they are doing, but that's sort of the opposite of embedding them in deterministic workflows, that is more of team integration/communication problem to solve.
_pdp_ 34 days ago
Or maybe, just maybe, LLMs do not run deterministicly and that is ok?
In the real world almost nothing runs like that - only software and even that has a lot of failures.
So perhaps rather than trying to make agents run deterministicly the goal is to assume some failure rate and find compensation control around it.
eth415 34 days ago
agreed - this is what we’ve been trying to build at scale.
https://github.com/salesforce/agentscript
[-]
- crsn 34 days ago
  Ditto Ethan's point -- and hundreds of customers tell us it works very well. We'd value more feedback from this community, not just the Salesforce/Agentforce customer base!
oinoom 34 days ago
this is just advocating for a harness, which has been the focus (along with evals) for at least the last three months by pretty much anyone working with agents professionally or seriously
moron4hire 34 days ago
I've been building this at work. It's... shockingly not hard. People have been telling me, "get into agentic coding now or you'll get left behind" and the things they are saying need training and taste and expertise to figure out how to cajole the AI into doing a job are things that I can just write a program to do.
There's this guy at work who is kind of precious about Claude Code. When Hegseth banned Anthropic, this guy freaked out. He spent many pages ranting about how terrible Gemini and Codex are and basically nuked his project. He insisted only Claude could do his project.
Meanwhile, I managed to redo his work with GPT 4o in a weekend. No AI generated code anywhere, just being capable of writing a for-loop over a directory of files my own self. The AI part is only really necessary because folks can't be bothered to author documents with proper hierarchies.
People talk about "AI is going to eliminate boilerplate and accelerate development and we'll do new jobs that were too costly before". Yet this guy spent weeks coaxing Claude to do something that took me a few hours because "boilerplate" is really not that big of a deal. If this is the kind of job we're going to be able to do because the value-to-effort ratio was less than 1, it kind of indicates to me that there isn't a lot of value to gain at any level of effort. Yeah, it's not really worth your time to bend over and pick up a penny, but even if I had a magical penny snagging magnet, I'm still going to ignore the pennies because that's just how valueless pennies are.
If AI lets me never have to open a PowerPoint from a client to read the chart values from the piechart they screenshot and pasted into PowerPoint, that's wonderful. What more would I ever need? The rest of the work just isn't that hard. But if you think AI is going to replace people like me because it can do "boilerplate", the AI is not anywhere near as fast or cheap at getting to a reliable, consistent, repeatable process as a human for that.
ubj 34 days ago
I've said this before, but it's interesting to see momentum go back and forth between the flexibility and ease of everyday language, and the formal rigor of programming languages.
It feels like we are still discovering the optimal operating range on a spectrum between these two domains. Perhaps the optimal range will depend on the specific field in question.
[-]
afxuh 34 days ago
thats why agents completes a project with the first 3 prompts, , then maintaining and fine-tuning it take ages till hits "-Session token expired"
pedroneto3 33 days ago
agreed, but I don't thinkg it gonna happen without the own IA help. People only think in earning and by this, we need a not-human vision
zekenie 34 days ago
you know it really depends on what you're trying to accomplish and if it's possible to describe it with deterministic control flow
try-working 34 days ago
that's why you need a recursive workflow that creates its own artifacts per step that can later be used for verification.
[-]
- Nizoss 34 days ago
  Sounds interesting, can you elaborate on your thinking? Got me curious.
  [-]
  - try-working 33 days ago
    how do you verify the work that was just done in the current stage? verify against the output artifacts from the previous stages. for example, if you have a requirement doc, then you can analyse the codebase for current state, and store as a doc. then generate the implementation plan based on the delta between requirements and current state. after implementation, create an implementation summary doc. to verify the implementation in the next stage, compare the implementation summary against the implementation plan, the previous codebase analysis and the original requirements doc, as well as codebase diffs.
    so, every stage outputs a source of truth for that stage, which can be used by later stages for verification, alone or together with other artifacts. if you want to read more, here's the recursive-mode development workflow I built: https://recursive-mode.dev/introduction
  - nhectropic 33 days ago
    [dead]
throwthrowuknow 34 days ago
Isn’t this basically what Palantir does?
terminalbraid 34 days ago
My friend, you have invented management.
[-]
- Nizoss 34 days ago
  Not throwing shade at anyone here but the thought has definitely crossed my mind that we are recreating SAFe but for agents when looking at some of the orchestration setups out there. I think that it is better to not force the same hierarchical processes that worked for humans in large organizations onto agents and instead look at what they need to give better results and what their failure modes look like.
sidcool 33 days ago
How does one achieve this?
[-]
- philipp-gayret 33 days ago
  Native integrations with agents, i.e. Claude Code's system of Hooks.
  Harnesses, which kick off agents with what to do.
  Tools, which show an agent where in a process it is, and what the next step should be.
  In my experience I find Hooks to be extremely powerful cross-project. CLI Tools are easy to make also, and work really well for guiding agents.
  [-]
  - sidcool 31 days ago
    Thanks. Any tutorials?
    [-]
    - philipp-gayret 31 days ago
      For Hooks (IMO the most powerful feature) Id recommend only https://code.claude.com/docs/en/hooks-guide and for Plugins, Skills, MCP and so on the official documentation by Anthropic has been the source you'd need. As for harnesses and CLI tools Id go with whatever you're already familiar with, can't make a particular recommendation.
marvinified 34 days ago
Depends on the use case
hombre_fatal 33 days ago
That this gets so much traction is an insight into the lack of process the average HNer is using while they say they can't get LLMs to do anything useful for them.
Turns out it really was just them expecting one-shots in Claude Code with "make no mistakes pls".
Something to keep in mind when listening to LLM discourse on HN.
droolingretard 34 days ago
Are you the guy who used to write MapleStory hacks?
ncrmro 34 days ago
deepwork.md is made for this.
[-]
- nhectropic 33 days ago
  [dead]
carterschonwald 34 days ago
i mean of course. ive been working on this the past few months and ive a bunch of tech towards this in flight, including some harness forks to layer my ideas in. eg my oh punkin pi test bed on my github.com/cartazio page , theres some shockingly obvious ince you see it tricks that i think i can stack into a really nice harness product for just doing hard real work with these models more easily
ares623 33 days ago
Guys, c'mon, what are we even doing...
AIorNot 34 days ago
I mean we have Langgraph, BAML etc
MagicMoonlight 33 days ago
This is slop generated right?
mpaiello 33 days ago
[dead]
schipperai 34 days ago
[flagged]
lacymorrow 33 days ago
[flagged]
lydionfinance 33 days ago
[flagged]
aditgupta 34 days ago
[dead]
theuniverseson 31 days ago
[flagged]
lacymorrow 34 days ago
[flagged]
hiroto_lemon 33 days ago
[flagged]
nicktaobo 33 days ago
[flagged]
fredcallagan 34 days ago
[flagged]
naturalintell 33 days ago
[dead]
pinfloyd 34 days ago
[dead]
Cart0ne 34 days ago
[flagged]
TodorGrudev 33 days ago
[flagged]
Linell 34 days ago
[dead]
maxothex 33 days ago
[flagged]
pschw 34 days ago
[dead]
BrightGirl 33 days ago
[dead]
Ozzie-D 33 days ago
[dead]
hajekt2 34 days ago
[dead]
arbayi 33 days ago
[dead]
noborutakahashi 34 days ago
[flagged]
pevansgreenwood 33 days ago
[dead]
Amber-chen 33 days ago
[flagged]
shouvik12 34 days ago
[flagged]
nhectropic 33 days ago
[dead]
pandalyt1c 34 days ago
[flagged]
Amber-chen 34 days ago
[flagged]
coltmcnealy 33 days ago
[dead]
jonahs197 34 days ago
[dead]
huflungdung 34 days ago
[dead]
taherchhabra 34 days ago
I wrote something recently on how agent development differs from traditional software development
https://x.com/i/status/2051706304859881495