LLMs Corrupt Your Documents When You Delegate

(arxiv.org)

217 points | by rbanffy 9 hours ago

23 comments

  • timacles 3 hours ago
    Least shocking thing I've read about LLMs recently.

    They are essentially like that one JPEG meme, where each pass of saving as JPEG slightly degrades the quality until by the end its unrecognizable.

    Except with LLMs, the starting point is intent. Each pass of the LLMs degrades the intent, like in the case of a precise scientific paper, just a little bit of nuance, a little bit of precision is lost with a re-wording here and there.

    LLMs are mean reversion machines, the more 'outside of their training' the context/work load they are currently dealing with, the more they will tend to gradually pull that into some homogenous abstract equilibrium

    • isityettime 53 minutes ago
      I've definitely experienced this while coding with LLMs. Often, after a flurry of feature work in which I thought I was being reasonably careful but moving very fast, I take a closer look at some small piece of code and go "holy shit". Then I have to spend a few hours going over everything and carefully reworking parts where things didn't quite go how I'd like, where I was unclear, or where the LLM's brainworms kicked in.

      Quality is really important to me in its own right, but I also worry about this exact "repeated compression" problem: when my codebase is clean and I have an up-to-date mental model, an LLM can quickly help me churn out some feature work and still leave the codebase in a reasonable state. But as the LLM dirties up the codebase, its past mistakes or misunderstandings compound, and it's likely to flub more and more things. So I have to go back and "restore" things to a correct state before I feel comfortable using the LLM again.

      • fiddlerwoaroof 36 minutes ago
        My experience mostly matches this: I think of a piece of development work having three phases:

        1. Prototype 2. Initial production implementation 3. Hardening

        My experience with LLMs is that they solve “writer’s block” problems in the prototyping phase at the expense of making phases 2+3 slower because the system is less in your head. They also have a mixed effect on ongoing maintenance: small tasks are easier but you lose some of the feel of the system.

      • majormajor 48 minutes ago
        Yeah, a lot of "it doesn't matter how the code looks" convos seem to be ignoring that we know what happens over time when you just make tactical the-tests-still-pass changes over and over and over again. Slowly some of those tests get corrupted without noticing. And you never had the ENTIRE spec (and all the edge-case but user-relied-on-things) covered anyway. And then new dev gets way harder.
      • originalvichy 22 minutes ago
        This is definitely most annoying when dealing with software or standards with slightly illogical or hard to grasp cases. Recently, I worked on one of the software community's favourite spaces, timezones, and kept getting myself and my LLM context polluted with the confusion that arises when using POSIX standard timezone notation and common human-readable formats.

        This blog probably covers my exact headache [0]. In summary, "Etc/GMT+6" actually means UTC-6. I was developing a one-off helper script to massively create calendars to a web app via API, and when trying to validate my CSV+Python script's results, I kept getting confused as to when do the CSV rows have correct data and when does the web app UI have correct data. LLM probably developed the Python script in a manner that translated this on-the-fly, but my human-readable "Calendar name" column which had "Etc/GMT+6" would generate a -6 in the web app. This probably would not have been a problem with explicit locations specified, but my use case would not allow for that.

        When trying to debug if something is wrong, the thinking trace was going into loops trying to figure out if the "problem" is coming from my directions, the code's bugs, or the CSV having incorrect data.

        Learning: when facing problems like this, try using the well-known "notepad file" methods to track problems like this, so that if the over-eager LLM starts applying quick code fixes – although YOU were the "problem's" source – it will be easier to undo or clean up code that was added to the repository during a confusing debug session. For me, it has been difficult to separate "code generated due to more resilient improvements" vs. "code generated during debugging that sort of changed some specific step of the script".

        (Do note that I am not an advanced software engineer, my practices are probably obvious to others. My repos are mainly comprised of sysadmin style shell/python helper code! :-) )

        [0]https://blacksheepcode.com/posts/til_etc_timezone_is_backwar...

    • ekidd 3 hours ago
      Where this result is actually interesting and relevant is when a coding agent splits a large source file into multiple smaller files. Opus + Claude Code will try to recite long sections of source code from memory into each of the new files, instead of using some sort of copy/paste operation like a human would.

      Moving a file is a bit easier. LLMs may sometimes try to recite the file from memory. But if you tell them to use "git mv" and fix the compiler errors, they mostly will.

      Ordinary editing on the other hand, generally works fine with any reasonable model and tool setup. Even Qwen3.6 27B is fine at this. And for in-place edits, you can review "git diff" for surprises.

      • ClikeX 1 hour ago
        > And for in-place edits, you can review "git diff" for surprises.

        I don't let AI touch git anyway, and I always review the diff after it generated stuff. If it modifies my documentation, I always want to check if it messed with the text instead of just added formatting.

        • isityettime 49 minutes ago
          This. I know the LLM agents often have their own little diff viewers and edit approval workflows, but for a high volume of code, I cannot imagine actually reviewing everything without leaning on much more capable Git tooling.

          I use Magit, and up until I started using LLM agents it was mostly a nice-to-have that I relied on casually. (I was definitely under-utilizing its power.) But for reviewing, selectively staging, and selectively rejecting the changes of an LLM agent? I feel like I'd die without it. Idk how others manage.

      • devmor 2 hours ago
        If you’re using LLMs for agentic work it is absolutely essential that you have a robust set of tools for them to use and the correct instructions to prompt their use.

        The LLM will come up with stupid ways to do things, common sense doesn’t exist for AI.

        • jvuygbbkuurx 2 hours ago
          Isn't this the whole reason they became viable in the last 6 months? The system prompt and harness is improving. It's less and less essential every day to roll your own.
          • embedding-shape 1 hour ago
            I don't think there is a single reason. Models are improving, so are the harnesses, prompts and we who use them a lot also get more proficient and learn where they can be used effectively vs not, so lots of improvements all over the ecosystem, brought together.

            Latest big change is probably how feasible local models are becoming, like Qwen 3.6 and Gemma 4, they're no longer easily getting stuck in loops and repetition, although on lower quantizations they still pretty much suck for agentic usage.

            • deadbabe 1 hour ago
              > we who use them a lot also get more proficient and learn where they can be used effectively vs not

              I think it’s always been obvious where an LLM could be used effectively and where it cannot, if you understand how they work and don’t see them as magical.

              The “increase in proficiency” is mostly people coming back to reality and being more intentional about LLM usage. There are no surprise discoveries here. One does not need to use an LLM a lot to get effective with them. A total noob could become effective on day 1 with proper guidance.

          • ekidd 1 hour ago
            The models also have far more intelligence built in. For example, the pi.dev agent harness has a system prompt which fits on a single page, and includes only 4 or 5 tools. Running with a small coding model like Qwen3.6 27B, this setup is completely capable of agentic coding.
          • bigstrat2003 1 hour ago
            They still aren't viable. Nothing changed within the last 6 months.
    • Kim_Bruning 2 hours ago
      There's a kid's game that illustrates this too: https://en.wikipedia.org/wiki/Telephone_game
      • embedding-shape 1 hour ago
        Maybe more relatable to the typical HN reader: You know when the top boss tells the lower bosses stuff, who then tells the lower bosses something and once it reaches you as an IC it's all different and corrupted compared to what it initially was? LLMs have the same effect, unsurprisingly.
    • Twirrim 3 hours ago
      A coworker talks about LLMs as "bullshit" layers. Not exactly dismissing them or being derogatory about them, but emphasising that each time you feed something through an LLM, what comes out the other side may not be what you expect/want. Like that guy at the pub sharing what he'd seen online somewhere, after a few pints. Might be accurate, but carries notable risk it's not.

      So e.g., don't use an LLM to call an API to gather data and produce a report on it, as that's feeding deterministic data through a "bullshit" layer, meaning you can't trust what comes out the other side. Instead use the LLM to help you write the code that will produce a deterministic output from deterministic data.

      I've seen co-workers use LLMs to summarise deterministic data coming from APIs and have reports be wildly off the mark as often as they are accurate. Depending on what they're looking at that can have catastrophic risk.

      • ben_w 3 hours ago
        Similar experience. I wouldn't say it even needs to be like some random person in the local pub: this behaviour is what you'd get from any game of telephone, book authors will say how you need to be blunt and direct about points in the book because readers will miss subtlety, anyone who has been quoted in a newspaper will have a story about the paper getting it wrong, etc.

        However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.

        It is ironic, the more AI becomes like us and less it acts like a traditional computer program, the worse it is at many things we want to use it for, but because collectively we're oblivious to our cognitive limitations we race into completely avoidable failures like this.

        • mpyne 2 hours ago
          > However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.

          This was the comment I was coming in to make: I worked in a pre-computing bureaucracy (the U.S. Navy's) and "staff you delegated work to have consistent trouble following the directions you provide for the delegated work" is just a fact of life.

          A lot of it is telephone game, a lot of it is is lack of real familiarity with office software, a lot of it is the inherent integration challenge from sending the same document out for coordination to dozens of stakeholders.

          All those mistakes you made fixes for based on comments in the draft that went out for O-6 review? At least 2 will pop up again at 1-star review because staffers will copy the same text back out from their local copy they had stashed during O-6 review rather than re-reviewing from scratch.

          Style guidance to meet the Admiral's preferred format? You can provide it but there's not a chance they'll follow it, formatting is for humanities majors so you'll need to catch and fix all that yourself.

          That's not to say the LLMs are foolproof or magically always correct, but a lot of these style of criticisms apply just as much, if not more, to the current status quo. I don't need LLMs to be perfect, I just need them to be better than the current alternatives.

      • giancarlostoro 2 hours ago
        Before Claude Code my strategy in JetBrains AI was to start a new chat convo per task it yielded better output.
      • glaslong 1 hour ago
        I like this framing. At least as "nondeterministic" vs "deterministic" layers for the folks who flinch at "bullshit." Also "broadly capable but lossy" versus "limited capability but reliable."

        Building structures of dependencies, the interface between each pair seems to collapse to the lesser of the two. So there's a ton of work right now going into TLA+, structured io, etc to force even a bit of reliability back into the LLM/program boundaries. To have any hope of chaining multiple LLM dependencies in a stack without the whole thing toppling chaotically.

    • mrcartmeneses 32 minutes ago
      Further, could we think of intent as some ordered state, and over time the LLM introduces entropy, eventually resulting in something akin to free-association?
    • ieieue 2 hours ago
      LLM’s are the most elaborate guessing machine man-kind has made. That’s makes it both useless and useful depending on what it is used for.

      That’s it. Once you look at everything through this lense everything makes sense - especially the fact there is no underlying understanding of reasoning and creativity. I don’t care what boosters say.

      • CamperBob2 2 hours ago
        I don't know what a "booster" is, but if a model can solve original math problems, then it's reasoning.

        If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.

        • oldsecondhand 10 minutes ago
          > If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.

          Logic is just syntactic manipulation of formulas. By the early 90s logical reasoning was pretty much solved with classical AI (the last building block being constraint logic programming).

        • figarus314 1 hour ago
          A model solving original math problems may look like human reasoning, but internally the model is choosing the next token based on what it has learned about probability around various patterns and structures. The model knows about correlations between problems, proof techniques and answer structures, and when it "reasons" it's selecting a high probability trajectory through that learned knowledge.

          A calculator is different because it is not probabilistic; it executes a fixed procedure. One of these models, when doing math, is more like a learned probabilistic system that understands enough structure around mathematics that some of its high probability trajectories seem like genuine reasoning.

          The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement. The model may produce the same output, and may even appear to approach it the same way, but the mechanism is a probabilistic pattern selection rather than explicit rule enforcement.

          • visarga 15 minutes ago
            You talk as if problem solving is a supervised (imitation) learning problem. No, it is a reinforcement learning problem, models learn by solving problems and getting rated. They generate their own training data. Optimal budget allocation is 1/3 cost pre-training, 1/3 for RL, and 1/3 on inference.
          • XMPPwocky 49 minutes ago
            > The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement.

            How is this different from "probabilistic pattern selection"?

          • ieieue 54 minutes ago
            It’s amazing simple things have to be reiterated.

            Perhaps it’s best if most admit they don’t have the fundamental ways of thinking to even participate in the conservation.

            When all nuance is lost, the discussion must end.

          • senordevnyc 27 minutes ago
            I don’t think there’s any evidence that “human reasoning” isn’t also based on probabilistic pattern selection.
        • Terr_ 1 hour ago
          My dear sir, the entire universe is made of things that "do math without reasoning!"

          It's the default, and if we're lucky we harness pieces of it to discern something we're interested in.

          • ieieue 56 minutes ago
            Your post is enough to tell me you have never contributed an original insight in life.
    • Forgeties79 1 hour ago
      I was talking about this in a thread yesterday. It’s why I don’t like blogs that are just LLM generated. I don’t care how good you think it is, I don’t care that you consider a facsimile of you good enough. If I want a rote, boring LLM response, I will prompt it myself. I do not appreciate reading blogs and other assumed to be human-generated content and having somebody attempt to trick me into reading their prompt results like some annoying middleman.

      I came to your blog to read what you had to say. Why are you writing a blog if you aren’t even going to write it?

    • threethirtytwo 3 hours ago
      A human doing the same tasks as what the LLM did in the paper that the human will degrade the document further then the LLM. If the LLM is 25%, a human would degrade it probably 80% if they used the same technique as the LLM did in this paper. I'm talking about a single pass.

      The fact of the matter is, humans don't edit things the way it was done in the paper and neither do coding agents like claude. Think about it: You do not ingest an entire paper and then regurgitate that paper with a single targeted edit... and neither do coding agents.

      Also think carefully. A 25% degradation rate is unacceptable in the industry. The AI change that's taking over all of SWE development would not actually exist if there was 25% degradation... that's way too much.

      • lelanthran 3 hours ago
        Are we comparing humans to LLMs or human written software to LLMs?

        The whole point of creating software to do things used to be getting things done more accurately and consistently.

        • ACCount37 2 hours ago
          No. The whole point of creating software is getting things done.

          "More accurately and consistently" was merely downstream from what capabilities were natural for machine logic and hard algorithms.

          Now, we're just spoiled for choice. We have hard algorithm software where we want to do things that benefit for accurate, consistent, highly deterministic behavior - and we have soft algorithm AI for when we want to do things that simply aren't amenable to hard logic.

          Machine translation used to be a horrid mess when we were trying to do it with symbolic systems. Because symbolic systems are "consistent, highly deterministic" but not at all "accurate" on translation tasks. Being able to leverage LLMs for that is a generational leap.

          • tommyage 15 minutes ago
            All of software is hard-coded algorithm.

            If you differ between AI source code and engineer source code say so. "Getting things done" is a business need. Which things get translated to a deterministic language executable by a computer is code.

            There are entire languages dedicated for lesser engineers/domain experts to formulate business requirements.

            Anyhow; What's your point? That we received a framework for "soft algorithms" where the output does not need to be correct and deducible? What's even the point of putting it into software. Just forward your input to the reader and let him judge on its own.

      • RevEng 2 hours ago
        Except that coding agents will do this at times. That's half the problem. A human will forget details and exaggerate others, but LLMs fail in spectacular ways that humans rarely would, like trying to copy a document from memory rather than one word at a time, side by side, or rewriting the whole thing just to make some simple changes. Coding agents will delete tests or return True to get them to pass - something you would never expect of even a junior professional.

        And I know this because I see it all the time. I use composer-2 and sonnet 4.6 on a regular basis. It's not much better for my colleagues who use Opus or GPT or any of the other frontier models. Most of the time it's fine, but other times it does things simply unforgivable for a human. I have to watch the agent closely so that it doesn't decide to nuke my database; I don't have to do that with any of my juniors, even those with little experience and poor discipline.

        • xp84 2 hours ago
          > nuke

          > I don’t have to do that with any of my juniors…

          For some values of “nuke,” I absolutely have had to do that with juniors in the past. Perhaps you’re referring to a single rm -r or hilarious force push or something, but undertrained and unsupervised juniors regularly introduce things like SQL injection, XSS, etc. simply because they don’t know any better yet. This isn’t saying “AI is better across the board” - I just don’t think they’re comparable, also think AI shouldn’t be used to chop the bottom 5 rungs off our career ladder. But let’s not pretend juniors can be left alone with a codebase without any worries.

  • simonw 3 hours ago
    I'm suspicious of their results with regards to tool usage.

    It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.

    They claim that tool use didn't help, which surprised me... but they also said:

    > To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.

    And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!

    The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...

    The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.

    They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.

    Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...

    The relevant prompt fragment is:

      You can approach the task in whatever
      way you find most effective:
      programmatically or directly
      by writing files
    
    As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.

    I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.

    • alansaber 11 minutes ago
      The incomprehensible methodology due to resource constraints or straight up for simplicity's sake make these papers worthless unfortunately
    • threethirtytwo 3 hours ago
      People love to interpret the results in the most negative way possible because it's a threat to their occupation and identity. I refer to HN specifically.

      The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.

      But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.

  • buffaloPizzaBoy 44 minutes ago
    I typically tell my agents to only treat document writing as a last "rendering" pass. LLMs are so good at taking sparse knowledge and compiling it, that I prefer to store knowledge as composable ideas/facts.

    What has worked well in practice is giving the agent a directory, and tell it to make independent markdown files for facts/findings it locates - with each file having front-matter for easy search-ability.

    This de-complects most tasks from "research AND store iteratively in a final document format" to more cohesive tasks "research a set of facts and findings which may be helpful for a document", and "assemble the document".

    Only a partial mitigation, but find it leads to more versatile re-use of findings, same as if a human was working.

  • causal 5 hours ago
    Yeah I've been saying this for a while: AI-washing any text will degrade it, compounding with each pass.

    "Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...

    • mohamedkoubaa 4 hours ago
      I've been calling it meanwit reversion
    • polskibus 4 hours ago
      By „with each pass” do you mean within the same session, or with new session (context window) each time?
      • sebastiennight 4 hours ago
        In my experience, it happens with each edit of the document, whether or not you clear the context window.

        You can somewhat mitigate this, at the same moment you ask for the new edit, by adding new info or specifying the lost meaning you want to add back. But other things will still get washed out.

        Nuances will drift, sharp corners will be ablated. You're doing a Xerox copy of your latest Xerox copy, so even if you add your comments with a sharpie, anything that was there right before will be slightly blurrier in the next version.

      • adampunk 4 hours ago
        Each edit, even with unrelated edits. I had a README referring to something as "the cathedral of s*t" (some HN commentators don't care for the swearing, which is systemically bad news but w/e) and the robot would lift that phrase out in drive-bys, repeatedly.

        Occasionally it would report the action, sometimes it would not bother to report it. It never reached into the README on an unrelated doc edit, but if it was touching the README, that line was getting excised.

  • wtetzner 2 hours ago
    I think the problem is that we're using LLMs to do too much of the work. We should aim to design agents that use the LLM as the thinnest possible layer to translate the natural language intent into a deterministic process, minimizing round trips to the LLM as much as possible.
  • rmwaite 1 hour ago
    What I find fascinating about LLMs is that a lot of their failures seem strikingly similar to the failures that humans struggle with. I’m not sure what this “means” but I think it’s interesting that we can theoretically fix these failures for LLMs but for humans it is much harder. You pretty much need to educate / indoctrinate people for their entire lives and even then it’s messy and unpredictable and prone to failure—just like LLMs.
  • meander_water 3 hours ago
    > We find that models are not failing due to “death by a thousand cuts” (i.e., many small errors). Instead, they main- tain near-perfect reconstruction in some rounds, and experience critical failures in a few rounds, typically losing 10-30+ points in a single round trip

    > We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.

    I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.

  • jonmoore 5 hours ago
    I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks.

    It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.

  • rao-v 10 minutes ago
    May your contexts always be short
  • andrewljohnson 2 hours ago
    LLM editing should be done to produce deterministic output.

    That is, the LLM should produce a diff, and the user should accept the diff. It seems like a bad pattern to just tell the LLM to edit any long document without that sort of visibility. Same goes for prose as for code.

    • alansaber 8 minutes ago
      This gets skipped because continual approvals break up user flow so we let LLMs make a few hundred line diffs then a user does a bulk review, and can just revert all/partially. It's naieve to assume user will review every LOC in every instance.
  • danielvaughn 2 hours ago
    I've spent the last few months reading a lot of AI-generated code. It's extremely difficult.

    It's like how psychopaths are eerie because there's nothing behind their eyes. AI-generated code is eerie because there's nothing between the lines. Code is in some sense theory building, and when you read a humans code you can (mostly) feel their theory working in the background. LLMs have no such theory, the code is just facts strewn about. Very weird experience to try and understand it.

    • leptons 13 minutes ago
      My company is moving to a workflow where we only write Jira tickets, the LLM writes all the code and submits a PR. Then we are supposed to review the code the LLM wrote.

      I'm looking for a new job.

    • glaslong 1 hour ago
      Thank you I've had trouble articulating this sense, but it's strong. An uncanny valley.
  • twobitshifter 2 hours ago
    I thought this was going to be about a problem we saw recently. Someone used an LLM to update the comment block at the start of each source file, and the LLM programmed its own tool that ended up changing ALL of the line endings when it output again with the corrected comment block. Instead of an LLM we could have used find and replace, but people are thinking LLM is the only tool.
  • woeirua 4 hours ago
    It's an interesting paper, but I'd like to see a lot more about the types of errors that the LLM makes. Are they happening in the forward pass or the inverse pass? My guess is the inverse pass.
  • bigstrat2003 1 hour ago
    We don't need a study to tell us that LLMs always make mistakes. We already knew that. Anyone with sense is not using LLMs because of that.
  • carterschonwald 3 hours ago
    this is literally just “leave a child at the work computer with a real doc open playing office”. otoh it is good to design benchmarks tonground these things.

    on the flip side if you’re literally just using a bare bones harness on top of a stochastic parrot, of course stochastic errors accumulate.

    theres a lot of ways for improving text faithfulness through harness tool designs, and my incremental experiments seem promising.

    but unless work is gated on shit like “the script used must type checked ghc haskell or lean4”, unsupervised stuff is gonna decay

  • adampunk 4 hours ago
    LLMs will make mistakes on every turn. The mistakes will have little to no apparent connection to "difficulty" or what may or may not be prevalent in the training data. They will be mistakes at all levels of operation, from planning to code writing to reporting. Whether those mistakes matter and whether you catch them is mostly up to you.

    I have yet to find a model that does not make mistakes each turn. I suspect that this kind of error is fundamentally incorrigible.

    The most interesting thing about LLMs is that despite the above (and its non-determinism) they're still useful.

    • simonw 3 hours ago
      > I have yet to find a model that does not make mistakes each turn

      What kind of mistakes are you talking about here?

    • pyrolistical 4 hours ago
      As a human I make typos all the time
      • leptons 6 minutes ago
        The LLM makes typos for me all the time using AI autocomplete. It's caused a lot of frustration while coding, because it makes mistakes that I would not. When it does help, it's great, but the errors waste as much time as the LLM saves me. Even using agentic coding, AI is mostly break-even for me.
      • dangus 3 hours ago
        A human can sit down and say “I’m going to make sure this is correct on the first pass and make sure I make an exact copy.”

        They have cognitive awareness of which tasks are highly critical and need more checking and re-checking without being prompted to think that way.

        For a human, time doesn’t stop when the first pass of the prompt and response is over. An LLM effectively wipes its memory of what it just did unless something is keeping track of a highly resource constrained context.

        An LLM is like an author of a book that immediately closes its eyes and wipes its memory after writing a chapter. Sure, it can pull some of that back in the next query via context, and it can regain context very quickly, but it effectively has no memory of the exact thing it just did.

        When a human is doing these tasks there is a lot of room for mistakes but there’s also a wildly higher capacity for flowing through time.

        • adampunk 3 hours ago
          Ok, and?
          • simonh 3 hours ago
            Humans understand what mistakes are and can reason about what constitutes a mistake and what doesn’t. LLMs can’t do that.

            It’s for the same reason that they will invent bullshit instead of saying “I don’t know”, when they don’t know. They don’t have a concept of accuracy of facts.

          • dangus 3 hours ago
            And that’s why I’m paid six figures and my LLM is paid $20/month.
      • adampunk 4 hours ago
        I do too! I also make higher level design errors and get too enthusiastic about projects before code is written.

        We are, in a sense, fallible machines who have designed a planet-wide computational fabric around that fact.

  • cyanydeez 5 hours ago
    I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.

    I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.

    The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.

    Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.

    • sebastiennight 4 hours ago
      > I've also had them convert to markdown something like an excel formatted document.

      This look like a task where the LLM would be best used in writing a deterministic script or program that then does the conversion.

      Trusting a LLM to make the change without tools is like telling the smartest person you know to just recite the converted document out loud from memory. At some point they'll get distracted, wrong, or unwittingly inject their own biases and ideas into it whenever the source data is counter-intuitive to them.

      • trollbridge 3 hours ago
        I see people cut and paste from Excep into a chat, as an image, and ask it to sum up numbers.
        • somewhatgoated 2 hours ago
          I’ve seen people drink their own recycled piss and inject coffee into their ass - what’s your point?
          • sebastiennight 30 minutes ago
            In the first half, I thought you were an astronaut, but the second half has me double-guessing myself.
      • cyanydeez 3 hours ago
        it was, but the formatting was garbage so it ran again to fix thw format.
  • threethirtytwo 3 hours ago
    This experiment needs to be put in perspective. Let me explain. IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

    The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.

    Here's how agentic AI currently typically do edits:

    1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.

    This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.

    AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.

  • Bmello11 33 minutes ago
    [flagged]
  • simonreiff 2 hours ago
    [dead]
  • arian_ 3 hours ago
    [flagged]
  • BrightGirl 2 hours ago
    [flagged]
  • OfekSh 2 hours ago
    [flagged]