The main challenge is figuring out where things went wrong. The issue could be in an early reasoning step, how context is passed between steps, or a subtle mistake that propagates through the system. By the time I see the final output, it’s not obvious which step caused the problem.
I’ve been using Langfuse for tracing, which helps capture inputs and outputs, but in practice I still end up manually inspecting each step one by one to diagnose issues, which gets tiring quickly.
I’m curious how others are approaching this. Are there better ways to structure or instrument these workflows to make failures easier to localize? Any patterns, tools, or techniques that have worked well for you?
In my memory system for agents, every recall query logs which memories surfaced, their relevance scores (broken into keyword vs. vector components), and whether the agent actually used them. This gives you a trace of what the agent was "thinking about" at each step — not just what it did.
When the final output is wrong, I look at the recall logs first. Usually the problem is one of: (1) the right context didn't surface because the scoring missed it, (2) the right context surfaced but was outranked by something irrelevant, or (3) the context was fine but the agent misinterpreted it.
Each of these has a different fix — better keywords/embeddings for (1), scoring weight tuning for (2), and prompt engineering for (3). Without the instrumented attention trace, they all look the same: "it got the wrong answer."