Ask HN: How do you debug multi-step AI workflows when the output is wrong?

I’ve been building multi-step AI workflows with multiple agents (planning, reasoning, tool use, etc.), and I sometimes run into cases where the final output is incorrect even though nothing technically fails. There are no runtime errors - just wrong results.

The main challenge is figuring out where things went wrong. The issue could be in an early reasoning step, how context is passed between steps, or a subtle mistake that propagates through the system. By the time I see the final output, it’s not obvious which step caused the problem.

I’ve been using Langfuse for tracing, which helps capture inputs and outputs, but in practice I still end up manually inspecting each step one by one to diagnose issues, which gets tiring quickly.

I’m curious how others are approaching this. Are there better ways to structure or instrument these workflows to make failures easier to localize? Any patterns, tools, or techniques that have worked well for you?

3 points | by terryjiang2020 1 day ago

5 comments

  • hifathom 4 hours ago
    The pattern that's worked for me: instrument the attention, not just the execution.

    In my memory system for agents, every recall query logs which memories surfaced, their relevance scores (broken into keyword vs. vector components), and whether the agent actually used them. This gives you a trace of what the agent was "thinking about" at each step — not just what it did.

    When the final output is wrong, I look at the recall logs first. Usually the problem is one of: (1) the right context didn't surface because the scoring missed it, (2) the right context surfaced but was outranked by something irrelevant, or (3) the context was fine but the agent misinterpreted it.

    Each of these has a different fix — better keywords/embeddings for (1), scoring weight tuning for (2), and prompt engineering for (3). Without the instrumented attention trace, they all look the same: "it got the wrong answer."

  • tucaz 1 day ago
    Do what you are doing but dump the contents of tracing into an LLM agent (cowork, code, opencode, etc) and ask for it to take a first pass. It’ll at least narrow it down for you. Use a smart model and it should be helpful.
    • terryjiang2020 13 hours ago
      Hmm, which model would be a smart one for this case? Or I just try the latest version of OpenAI/Gemini/Claude, then?
      • tucaz 4 hours ago
        I love Claude Code but that can be expensive. If you are on a budget you can do K2.5 with OpenCode.
  • syumpx 10 hours ago
    multi step may be whats killing it. simply and let llm do the work
  • BlueHotDog2 1 day ago
    just releasing something in the direction. a git like for agents
  • newzino 1 day ago
    [dead]